Introduction
The recent launch of AI-based end-user applications, like ChatGPT, captured the public's attention and sent ripples throughout the business community. For many SMEs, once elusive concepts like AI, machine learning, and big data suddenly became tangible business opportunities. To support this AI-driven transformation, organizations need significantly more processing power.
For this reason, Intel has equipped its 4th Gen Xeon Scalable processors with the AMX extension to speed up high-performance computing workloads.
Learn more about AMX and use it to optimize your AI development pipelines.
What Is Intel AMX (Advanced Matrix Extensions)?
Intel Advanced Matrix Extensions (AMX) is an instruction set extension integrated into 4th Gen Intel Xeon Scalable CPUs. The AMX extension is designed to accelerate matrix-oriented operations, which are primarily used in training deep neural networks (DNNs) and AI inference.
By introducing new data types and instructions, AMX aims to streamline matrix multiplication and accumulation operations and reduce power consumption.
Deep learning frameworks like PyTorch and TensorFlow can leverage AMX data types and instructions. This allows developers to run and optimize AI inferencing routines without handling hardware specifics.
AMX Architecture
AMX uses an innovative tile structure to increase the density of matrix operations. By facilitating parallel computations, it opens the way for significant performance enhancements in AI and machine learning tasks.
The most important components of Intel AMX architecture include:
1. Tile-Based Architecture. Large data chunks are stored in two-dimensional (2D) 1 kilobyte register files. Data is formatted into a set of eight 2D register files called a tile. Tiles are designed to keep data near the execution units, which improves data reuse and reduces memory bandwidth requirements for matrix operations.
2. Tile Matrix Multiplication (TMUL): TMUL is an accelerator engine for controlling and managing tiles and their states. It focuses on matrix-multiply computations like dense linear algebra workloads, essential for AI training and inference.
The tile-based architecture allows Intel AMX to store more data in each core and compute larger matrices in a single operation.
AMX Data Types
The FP32 floating-point format is used in AI workloads for its precision. It is ideal for higher accuracy but requires more computing resources and longer computation times, which may not be practical for all applications.
AMX supports lower precision INT8 and BF16 data types:
- INT8 for Inferencing. AMX provides enhanced support for INT8 operations, which are critical for inferencing workloads. The INT8 data type sacrifices precision to process multiple operations in each compute cycle. It requires fewer computing resources, which makes it ideal for deployment in real-time applications and matrix multiplication tasks where speed and efficiency take precedence.
- Bfloat16 (BF16) Support. AMX provides native support for the BF16 floating-point format. This data type occupies 16 bits of computer memory. BF16 delivers intermediate accuracy for most AI training workloads and can also deliver higher accuracy for inferencing if needed. It is particularly useful in ML because it allows models to be trained with almost the same accuracy as when using 32-bit floating-point numbers but at a fraction of the computational cost.
Note: Open-source frameworks such as TensorFlow and PyTorch are optimized for INT8 and BF16 operations by default.
The tiled architecture and native support for the BF16 data type give Intel CPUs with integrated AMX acceleration a significant performance advantage over their predecessors.
The table shows performance utilization for different data types, detailing operations per cycle for the 3rd Gen Intel Xeon (Intel AVX-512 VNNI) and 4th Gen Intel Xeon (Intel AMX) processors.
3rd Gen Intel Xeon (Intel AVX-512 VNNI) | 4th Gen Intel Xeon (Intel AMX) | SPEED INCREASE | |
---|---|---|---|
Operations per Cycle (Data Type) | 64 (FP32) | 1024 (BF16) | AMX is 16x faster |
Operations per Cycle (Data Type) | 256 (INT8) | 2048 (INT8) | AMX is 8x faster |
Note: 4th Gen Intel Xeon processors can transition between Intel AMX and Intel AVX-512, selecting the most efficient instruction set based on workload requirements.
AMX Performance
Relative Throughput
The semiconductor industry has consistently doubled computing power roughly every two years.
The following table shows that AMX architecture outperforms the incremental core count across various Xeon processor generations. Although the number of cores has only doubled since the first Intel Xeon Scalable processor, the relative throughput has increased 11 times.
Performance test parameters include:
- Reference Point: 1st Gen Intel Xeon Scalable processor (Intel DL Boost Instruction Set).
- Deep Learning Model: ResNet-50 v1.5 (Batch Inferencing).
- Framework: TensorFlow.
- Data Type: INT8.
Processor Generation | Instruction Set Extention | Cores | Relative Throughput |
---|---|---|---|
Intel Xeon Scalable CPU | Intel DL Boost | 28 cores | Baseline |
2nd Gen Intel Xeon Scalable CPU | Intel DL Boost | 28 cores | 2x |
3rd Gen Intel Xeon Scalable CPU | Intel DL Boost | 40 cores | 4x |
4th Gen Intel Xeon Scalable CPU | Intel AMX | 56 cores | 11x |
AI Training Performance Boost
This table illustrates the acceleration in PyTorch training performance when using the 4th Gen Intel Xeon Platinum 8480+ processor (Intel AMX BF16) compared to the 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
Performance test details are:
- Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
- Framework Used: PyTorch.
Task/Model | Category | Performance Increase |
---|---|---|
ResNet-50 v1.5 | Image classification | 3x |
BERT-large | Natural Language Processing (NLP) | 4x |
DLRM | Recommendation system | 4x |
Mask R-CNN | Image segmentation | 4.5x |
SSD-ResNet-34 | Object detection | 5.4x |
RNN-T | Speech recognition | 10.3x |
Real-Time Inference Performance Boost
The table below illustrates the generation-to-generation performance increase in PyTorch real-time inference when using the 4th Gen Intel Xeon Platinum 8480+ processor (Intel AMX BF16) compared to the 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
Performance test parameters are:
- Reference Point: 3rd Gen Intel Xeon Platinum 8380 processor (FP32).
- Framework: PyTorch.
Task/Model | Category | Performance Increase |
---|---|---|
ResNeXt101 32x16d | Image classification | 5.70x |
ResNet-50 v1.5 | Image classification | 6.19x |
BERT-large | Natural Language Processing (NLP) | 6.25x |
Mask R-CNN | Image segmentation | 6.24x |
RNN-T | Speech recognition | 8.61x |
SSD-ResNet-34 | Object detection | 10.01x |
Note: Check out our list of the best AI processors.
Intel AMX on phoenixNAP BMC Platform
Owning and maintaining AI infrastructure is not a viable option for many companies due to the high costs and lack of flexibility.
Transitioning to an AI-oriented environment with OpEx-based access to infrastructure has substantial financial and strategic benefits:
- Costs shift from capital expenditures to ongoing operational expenses.
- Businesses can seamlessly scale services based on their immediate needs.
- Expenses are more predictable, which is significant for cash flow management.
Note: Cloud services and software as a service (SaaS) are prime examples of OpEx-modeled access.
phoenixNAP's Bare Metal Cloud (BMC) is an OpEx-modeled platform that allows quick provisioning and scaling of dedicated servers via API, CLI, or Web UI.
BMC offers pre-configured server instances powered by 4th Gen Intel Xeon Scalable CPUs with built-in AMX accelerators. By combining the capabilities of Intel AMX and the BMC platform, users can:
- Deploy enterprise-ready environments optimized for extracting value out of large datasets in minutes.
- Leverage tools like Terraform and Ansible to automate deployments and scale AI infrastructure as needed.
- Accelerate matrix operations to boost AI application accuracy and speed.
- Reduce time-to-insight with one-click access to CPUs and workload acceleration engines.
Gen Intel Xeon Scalable Processors on Bare Metal Cloud deliver immediate value for the following use cases:
Application Category | Use Case |
---|---|
Artificial Intelligence | Recommendation systems Natural language processing. Image recognition. Object detection. Machine learning applications. Video analytics. |
Data Analytics | Relational database management systems. In-memory databases. Big data analytics Data warehousing. |
Networking | Hardware cryptography. Packet processing. Content delivery network. Security gateway. |
Storage Deployment | Distributed and virtual storage. |
High-Performance Computing (HPC) | Computational fluid dynamics. Molecular dynamics. Weather simulation. Heavy-duty AI training and inference. FinTech. Drug discovery. |
Data Security | Confidential computing. Regulatory or compliance workloads. Federated learning systems. |
Ecommerce | Reduce transaction time. Manage peak demands. UX and behavior analysis. Automated customer support. |
Conclusion
AI-driven solutions will become the norm for most end-users. As the cost of performing matrix computations on large datasets continues to rise, companies must explore solutions that will keep them competitive without breaking the bank.
Use the phoenixNAP BMC platform and Intel AMX to deploy and manage a flexible and scalable AI-focused infrastructure. This combination not only supports varied matrix sizes today but is also adaptable to potentially new matrix types down the line.