Intel AMX (Advanced Matrix Extensions) Explained

By
Vladimir Kaplarevic
Published:
November 2, 2023

For many SMEs, once elusive concepts like AI, machine learning, and big data are now tangible business opportunities. To support this AI-driven transformation, organizations need significantly more processing power.

This is why Intel has equipped its latest Xeon Scalable processors with the AMX extension, designed to accelerate high-performance computing workloads.

Learn more about AMX and use it to optimize your AI development pipelines.

Intel AMX accelerator on the horizon.

What Is Intel AMX (Advanced Matrix Extensions)?

Intel Advanced Matrix Extensions (AMX) is an instruction set extension integrated into 4th Gen (Sapphire Rapids), 5th Gen (Emerald Rapids), and now 6th Gen (Granite Rapids) Intel Xeon Scalable CPUs. The AMX extension is designed to accelerate matrix-oriented operations, which are primarily used in training deep neural networks (DNNs) and AI inference.

By introducing new data types and instructions, AMX aims to streamline matrix multiplication and accumulation operations and reduce power consumption.

Deep learning frameworks, like PyTorch and TensorFlow, can leverage AMX data types and instructions. This enables developers to run and optimize AI inference routines without needing to handle hardware specifics.

The relationship between machine learning, deep learning, and AI.

AMX Architecture

AMX uses an innovative tile structure to increase the density of matrix operations. By facilitating parallel computations, it opens the way for significant performance enhancements in AI and machine learning tasks.

The most important components of Intel AMX architecture include:

1. Tile-Based Architecture. Large data chunks are stored in two-dimensional (2D) 1 kilobyte register files. Data is formatted into a set of eight 2D register files called a tile. Tiles are designed to keep data near the execution units, which improves data reuse and reduces memory bandwidth requirements for matrix operations.

2. Tile Matrix Multiplication (TMUL): TMUL is an accelerator engine for controlling and managing tiles and their states. It focuses on matrix-multiply computations like dense linear algebra workloads, essential for AI training and inference.

Intel AMX high level architecture.

The tile-based architecture allows Intel AMX to store more data in each core and compute larger matrices in a single operation.

AMX Data Types

The FP32 floating-point format is used in AI workloads for its precision. It is ideal for higher accuracy but requires more computing resources and longer computation times, which may not be practical for all applications.

AMX supports lower precision INT8, BF16, and FP16 data types:

  • INT8 for Inferencing. AMX provides enhanced support for INT8 operations, which are critical for inferencing workloads. The INT8 data type sacrifices precision to process multiple operations in each compute cycle. It requires fewer computing resources, which makes it ideal for deployment in real-time applications and matrix multiplication tasks where speed and efficiency take precedence.
  • Bfloat16 (BF16) Support. AMX provides native support for the BF16 floating-point format. This data type occupies 16 bits of computer memory. BF16 delivers intermediate accuracy for most AI training workloads and can also deliver higher accuracy for inferencing if needed. It is particularly useful in ML because it allows models to be trained with almost the same accuracy as when using 32-bit floating-point numbers but at a fraction of the computational cost.
  • FP16 (16-bit IEEE) for training and inference. 6th Gen Xeon (Granite Rapids) processors support the FP16 data type, which fully complies with the IEEE 754 half-precision standard. Since FP16 uses half the memory of FP32, more values can fit into the processor's cache, resulting in faster computations. Many AI frameworks already use FP16 on GPUs. This enables users to deploy FP16-trained models from GPU environments directly on CPUs, without needing to convert data types or retrain the model.

Note: Open-source frameworks such as TensorFlow and PyTorch are optimized to use INT8, BF16, or FP16 automatically based on hardware performance.

The tiled architecture and native support for the BF16 and FP16 data types give Intel CPUs with integrated AMX acceleration a significant performance advantage over their predecessors.

The table shows performance utilization for different data types, detailing operations per cycle for the 3rd Gen Intel Xeon (Intel AVX-512 VNNI) and AMX-eqquipped Intel Xeon processors.

Operations per Cycle
(Data Type)
3rd Gen Intel Xeon (Intel AVX-512 VNNI)4th Gen Intel Xeon (AMX)5th Gen Intel Xeon (AMX)6th Gen Intel Xeon (AMX+FP16)
SPEED INCREASE
FP3264102410241024Up to 16x faster
INT8256204820482048Up to 8x faster
BF16N/A102410241024Up to 12x faster (when switching from FP32 on 3rd Gen to BF16 on AMX)
FP16N/AN/AN/A1024Only available in 6th Gen Xeon (Granite Rapids) processors.

Raw operations per cycle for AMX remain constant from Gen 4 through Gen 6. However, 5th and 6th Gen Intel Xeon Scalable processors also benefit from DDR5 memory, higher memory bandwidth, and deliver more consistent performance during sustained workloads.

AMX Performance

Relative Throughput

The semiconductor industry has consistently doubled computing power roughly every two years.

The following table shows that AMX architecture outperforms the incremental core count across various Xeon processor generations. Although the number of cores has only increased 2.3 times since the first Intel Xeon Scalable processor, the relative throughput has increased over 15 times.

Performance test parameters include:

  • Reference Point: 1st Gen Intel Xeon Scalable processor (Intel DL Boost Instruction Set).
  • Deep Learning Model: ResNet-50 v1.5 (Batch Inferencing).
  • Framework: TensorFlow.
  • Data Type: INT8.
Processor GenerationInstruction Set ExtensionCoresRelative Throughput
1st Gen Intel Xeon Scalable CPUIntel DL Boost28 coresBaseline
2nd Gen Intel Xeon Scalable CPUIntel DL Boost28 cores2x
3rd Gen Intel Xeon Scalable CPUIntel DL Boost40 cores4x
4th Gen Intel Xeon Scalable CPUIntel AMX56 cores11x
5th Gen Intel Xeon Scalable CPUIntel AMX56-64 cores~13x
6th Gen Intel Xeon Scalable CPUIntel AMX64 cores~15x

Architectural enhancements, such as improved AMX tile utilization and higher sustained base frequencies, are the primary reasons for the relative throughput gains in 5th Gen (Emerald Rapids) and 6th Gen (Granite Rapids) Xeon processors.

AI Training Performance Boost

This table illustrates the acceleration in PyTorch training performance across successive generations of Intel Xeon processors. The baseline is the 3rd Gen Intel Xeon Platinum 8380 processor, which uses FP32 precision on AVX-512. The 4th Gen Intel Xeon Platinum 8480+ processor introduces Intel AMX with BF16, while 6th Gen (Granite Rapids) further accelerates AI training throughput by adding support for the FP16 data type.

Performance test details are:

  • Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
  • Framework Used: PyTorch.
  • Data Types: FP32 (Gen 3), BF16 (Gen 4 & 5), FP16 (Gen 6).
Task/ModelCategoryGen 4
(AMX+BF16) Performance Increase
Gen 5 (AMX+DDR5+BF16) Performance IncreaseGen 6 (AMX+DDR5+FP16) Performance Increase
ResNet-50 v1.5Image classification3x3.3x6x
BERT-largeNatural Language Processing (NLP)4x4.8x9.6x
DLRMRecommendation system4x4.6x8.8x
Mask R-CNNImage segmentation4.5x5.2x9.4x
SSD-ResNet-34Object detection5.4x6.2x11x
RNN-TSpeech recognition10.3x12.4x24x

Real-Time Inference Performance Boost

As in the previous section, the 3rd Gen Intel Xeon Platinum 8380 processor (FP32) is used as a baseline to illustrate real-time inference performance across AMX-powered Intel Xeon processors. The 4th and 5th Gen Xeon processors introduce support for BF16, while the 6th Gen (Granite Rapids) adds FP16 support.

These lower-precision data types increase throughput without sacrificing model accuracy. Combined with architectural improvements such as DDR5 memory, they enable faster real-time inference at a lower computational cost.

Performance test parameters are:

  • Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
  • Framework: PyTorch.
  • Data Types: FP32 (Gen 3), BF16 (Gen 4 & 5), FP16 (Gen 6).
Task/ModelCategoryGen 4
(AMX+BF16) Performance Increase
Gen 5 (AMX+DDR5+BF16) Performance IncreaseGen 6 (AMX+DDR5+FP16) Performance Increase
ResNeXt101 32x16dImage classification5.70x6.6x12.5x
ResNet-50 v1.5Image classification6.19x7.1x13.5x
BERT-largeNatural Language Processing (NLP)6.25x7.5x15x
Mask R-CNNImage segmentation6.24x7.2x13x
RNN-TSpeech recognition8.61x10.3x20x
SSD-ResNet-34Object detection10.01x11.5x23x

Note: Check out our list of the best AI processors.

Intel AMX on phoenixNAP BMC Platform

Owning and maintaining AI infrastructure is not a viable option for many companies due to the high costs and lack of flexibility.

Transitioning to an AI-oriented environment with OpEx-based access to infrastructure has substantial financial and strategic benefits:

Note: AI-focused cloud environments that combine edge computing, multi-cloud orchestration, and OpEx-modeled access are increasingly being described as neocloud. Learn more about this emerging industry buzzword in our What Is Neocloud? article.

phoenixNAP's Bare Metal Cloud (BMC) is an OpEx-modeled platform that allows quick provisioning and scaling of dedicated servers via API, CLI, or Web UI.

BMC offers pre-configured server instances powered by Intel Xeon Scalable CPUs with built-in AMX accelerators. By combining the capabilities of Intel AMX and the BMC platform, users can:

  • Deploy enterprise-ready environments optimized for extracting value out of large datasets in minutes.
  • Leverage tools like Terraform and Ansible to automate deployments and scale AI infrastructure as needed.
  • Accelerate matrix operations to boost AI application accuracy and speed.
  • Reduce time-to-insight with one-click access to CPUs and workload acceleration engines.

The latest generations of Intel Xeon Scalable processors available on Bare Metal Cloud deliver immediate value for the following use cases:

Application CategoryUse Case
Artificial IntelligenceRecommendation systems.
Natural language processing.
Image recognition.
Object detection.
Machine learning applications.
Video analytics.
Data AnalyticsRelational database management systems.
In-memory databases.
Big data analytics
Data warehousing.
NetworkingHardware cryptography.
Packet processing.
Content delivery network.
Security gateway.
Storage DeploymentDistributed and virtual storage.
High-Performance Computing (HPC)Computational fluid dynamics.
Molecular dynamics.
Weather simulation.
Heavy-duty AI training and inference.
FinTech.
Drug discovery.
Data SecurityConfidential computing.
Regulatory or compliance workloads.
Federated learning systems.
EcommerceReduce transaction time.
Manage peak demands.
UX and behavior analysis.
Automated customer support.

Conclusion

AI-driven solutions will become the norm for most end-users. As the cost of performing matrix computations on large datasets continues to rise, companies must explore solutions that will keep them competitive without breaking the bank.

Use the phoenixNAP BMC platform and Intel AMX to deploy and manage a flexible and scalable AI-focused infrastructure. This combination not only supports varied matrix sizes today but is also adaptable to potentially new matrix types down the line.

Was this article helpful?
YesNo