Intel AMX (Advanced Matrix Extensions) Explained

November 2, 2023

Introduction

The recent launch of AI-based end-user applications, like ChatGPT, captured the public's attention and sent ripples throughout the business community. For many SMEs, once elusive concepts like AI, machine learning, and big data suddenly became tangible business opportunities. To support this AI-driven transformation, organizations need significantly more processing power.

For this reason, Intel has equipped its 4th Gen Xeon Scalable processors with the AMX extension to speed up high-performance computing workloads.

Learn more about AMX and use it to optimize your AI development pipelines.

Intel AMX accelerator on the horizon.

What Is Intel AMX (Advanced Matrix Extensions)?

Intel Advanced Matrix Extensions (AMX) is an instruction set extension integrated into 4th Gen Intel Xeon Scalable CPUs. The AMX extension is designed to accelerate matrix-oriented operations, which are primarily used in training deep neural networks (DNNs) and AI inference.

By introducing new data types and instructions, AMX aims to streamline matrix multiplication and accumulation operations and reduce power consumption.

Deep learning frameworks like PyTorch and TensorFlow can leverage AMX data types and instructions. This allows developers to run and optimize AI inferencing routines without handling hardware specifics.

The relationship between machine learning, deep learning, and AI.

AMX Architecture

AMX uses an innovative tile structure to increase the density of matrix operations. By facilitating parallel computations, it opens the way for significant performance enhancements in AI and machine learning tasks.

The most important components of Intel AMX architecture include:

1. Tile-Based Architecture. Large data chunks are stored in two-dimensional (2D) 1 kilobyte register files. Data is formatted into a set of eight 2D register files called a tile. Tiles are designed to keep data near the execution units, which improves data reuse and reduces memory bandwidth requirements for matrix operations.

2. Tile Matrix Multiplication (TMUL): TMUL is an accelerator engine for controlling and managing tiles and their states. It focuses on matrix-multiply computations like dense linear algebra workloads, essential for AI training and inference.

Intel AMX high level architecture.

The tile-based architecture allows Intel AMX to store more data in each core and compute larger matrices in a single operation.

AMX Data Types

The FP32 floating-point format is used in AI workloads for its precision. It is ideal for higher accuracy but requires more computing resources and longer computation times, which may not be practical for all applications.

AMX supports lower precision INT8 and BF16 data types:

  • INT8 for Inferencing. AMX provides enhanced support for INT8 operations, which are critical for inferencing workloads. The INT8 data type sacrifices precision to process multiple operations in each compute cycle. It requires fewer computing resources, which makes it ideal for deployment in real-time applications and matrix multiplication tasks where speed and efficiency take precedence.
  • Bfloat16 (BF16) Support. AMX provides native support for the BF16 floating-point format. This data type occupies 16 bits of computer memory. BF16 delivers intermediate accuracy for most AI training workloads and can also deliver higher accuracy for inferencing if needed. It is particularly useful in ML because it allows models to be trained with almost the same accuracy as when using 32-bit floating-point numbers but at a fraction of the computational cost.

Note: Open-source frameworks such as TensorFlow and PyTorch are optimized for INT8 and BF16 operations by default.

The tiled architecture and native support for the BF16 data type give Intel CPUs with integrated AMX acceleration a significant performance advantage over their predecessors.

The table shows performance utilization for different data types, detailing operations per cycle for the 3rd Gen Intel Xeon (Intel AVX-512 VNNI) and 4th Gen Intel Xeon (Intel AMX) processors.

3rd Gen Intel Xeon (Intel AVX-512 VNNI)4th Gen Intel Xeon (Intel AMX)SPEED INCREASE
Operations per Cycle
(Data Type)
64 (FP32)1024 (BF16)AMX is 16x faster
Operations per Cycle
(Data Type)
256 (INT8)2048 (INT8)AMX is 8x faster

Note: 4th Gen Intel Xeon processors can transition between Intel AMX and Intel AVX-512, selecting the most efficient instruction set based on workload requirements.

AMX Performance

Relative Throughput

The semiconductor industry has consistently doubled computing power roughly every two years.

The following table shows that AMX architecture outperforms the incremental core count across various Xeon processor generations. Although the number of cores has only doubled since the first Intel Xeon Scalable processor, the relative throughput has increased 11 times.

Performance test parameters include:

  • Reference Point: 1st Gen Intel Xeon Scalable processor (Intel DL Boost Instruction Set).
  • Deep Learning Model: ResNet-50 v1.5 (Batch Inferencing).
  • Framework: TensorFlow.
  • Data Type: INT8.
Processor GenerationInstruction Set ExtentionCoresRelative Throughput
Intel Xeon Scalable CPUIntel DL Boost28 coresBaseline
2nd Gen Intel Xeon Scalable CPUIntel DL Boost28 cores2x
3rd Gen Intel Xeon Scalable CPUIntel DL Boost40 cores4x
4th Gen Intel Xeon Scalable CPUIntel AMX56 cores11x

AI Training Performance Boost

This table illustrates the acceleration in PyTorch training performance when using the 4th Gen Intel Xeon Platinum 8480+ processor (Intel AMX BF16) compared to the 3rd Gen Intel Xeon Platinum 8380 processor (FP32).

Performance test details are:

  • Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
  • Framework Used: PyTorch.
Task/ModelCategoryPerformance Increase
ResNet-50 v1.5Image classification3x
BERT-largeNatural Language Processing (NLP)4x
DLRMRecommendation system4x
Mask R-CNNImage segmentation4.5x
SSD-ResNet-34Object detection5.4x
RNN-TSpeech recognition10.3x

Real-Time Inference Performance Boost

The table below illustrates the generation-to-generation performance increase in PyTorch real-time inference when using the 4th Gen Intel Xeon Platinum 8480+ processor (Intel AMX BF16) compared to the 3rd Gen Intel Xeon Platinum 8380 processor (FP32).

Performance test parameters are:

  • Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
  • Framework: PyTorch.
Task/ModelCategoryPerformance Increase
ResNeXt101 32x16dImage classification5.70x
ResNet-50 v1.5Image classification6.19x
BERT-largeNatural Language Processing (NLP)6.25x
Mask R-CNNImage segmentation6.24x
RNN-TSpeech recognition8.61x
SSD-ResNet-34Object detection10.01x

Note: Check out our list of the best AI processors.

Intel AMX on phoenixNAP BMC Platform

Owning and maintaining AI infrastructure is not a viable option for many companies due to the high costs and lack of flexibility.

Transitioning to an AI-oriented environment with OpEx-based access to infrastructure has substantial financial and strategic benefits:

Note: Cloud services and software as a service (SaaS) are prime examples of OpEx-modeled access.

phoenixNAP's Bare Metal Cloud (BMC) is an OpEx-modeled platform that allows quick provisioning and scaling of dedicated servers via API, CLI, or Web UI.

BMC offers pre-configured server instances powered by 4th Gen Intel Xeon Scalable CPUs with built-in AMX accelerators. By combining the capabilities of Intel AMX and the BMC platform, users can:

  • Deploy enterprise-ready environments optimized for extracting value out of large datasets in minutes.
  • Leverage tools like Terraform and Ansible to automate deployments and scale AI infrastructure as needed.
  • Accelerate matrix operations to boost AI application accuracy and speed.
  • Reduce time-to-insight with one-click access to CPUs and workload acceleration engines.

Gen Intel Xeon Scalable Processors on Bare Metal Cloud deliver immediate value for the following use cases:

Application CategoryUse Case
Artificial IntelligenceRecommendation systems
Natural language processing.
Image recognition.
Object detection.
Machine learning applications.
Video analytics.
Data AnalyticsRelational database management systems.
In-memory databases.
Big data analytics
Data warehousing.
NetworkingHardware cryptography.
Packet processing.
Content delivery network.
Security gateway.
Storage DeploymentDistributed and virtual storage.
High-Performance Computing (HPC)Computational fluid dynamics.
Molecular dynamics.
Weather simulation.
Heavy-duty AI training and inference.
FinTech.
Drug discovery.
Data SecurityConfidential computing.
Regulatory or compliance workloads.
Federated learning systems.
EcommerceReduce transaction time.
Manage peak demands.
UX and behavior analysis.
Automated customer support.

Conclusion

AI-driven solutions will become the norm for most end-users. As the cost of performing matrix computations on large datasets continues to rise, companies must explore solutions that will keep them competitive without breaking the bank.

Use the phoenixNAP BMC platform and Intel AMX to deploy and manage a flexible and scalable AI-focused infrastructure. This combination not only supports varied matrix sizes today but is also adaptable to potentially new matrix types down the line.

Was this article helpful?
YesNo
Vladimir Kaplarevic
Vladimir is a resident Tech Writer at phoenixNAP. He has more than 7 years of experience in implementing e-commerce and online payment solutions with various global IT services providers. His articles aim to instill a passion for innovative technologies in others by providing practical advice and using an engaging writing style.
Next you should read
Gen3 vs Gen 4 Xeon Scalable CPUs for AI Use Cases
November 2, 2023

This article reviews benchmark results of AI-optimized Intel CPUs hosted on phoenixNAP Bare Metal Cloud. It considers 3 popular AI use cases: image recognition, natural learning processing (NLP), and recommendation engine.
Read more
Big Data Servers Explained
August 19, 2021

Big data servers are servers specifically made for collecting and analyzing unstructured and constantly expanding data from various sources. Learn about the hardware specifications and what software runs on big data servers.
Read more
How to Install TensorFlow on Ubuntu
August 29, 2024

TensorFlow is Google’s open-source platform for machine learning. This article shows how to install TensorFlow on Ubuntu both for CPU or GPU support.
Read more
What Is GPU Computing?
November 2, 2023

This article is an intro to GPU computing and the benefits of using GPUs as "coprocessors" to central processing units (CPUs).
Read more