The application of deep-learning techniques to such problem domains as natural language processing, image processing, recommendation systems, and AI/machine learning continue to proliferate and consume massive amounts of data in search of (probable) answers. These algorithms are rooted in the processing of matrices of data of very large datasets.
Classic matrix processing algorithm performance is fraught with nested loops and often little real work within. Common optimizations are to unroll the loop(s) entirely to maximize the compute pipeline and avoid stalls or bubbles due to control loop decisions. A drawback to this optimization approach is that each kind of matrix size, data type and operation combination needs a dedicated routine, and likely other fine-tuning, for performing that operation.
The Intel Advanced Matrix Extensions (AMX) instruction set elevates matrix operation peformance even further by providing dedicated matrix processing hardware.
The Intel AMX [1] instruction set extension provides a set of registers in which matrix data is held while an instruction is utilized to process the data within the matrix. The matrix dimensions are programmable, yielding a much more compute efficient mechanism for handling enormous streams of matrix operations.
For a deeper dive into AMX technology itself, please examine the Intel documentation [2,3].
Intel’s Sapphire Rapids is the first processor to feature the AMX extension.
Oracle’s Unbreakable Enterprise Kernel 7 Update 1, includes kernel support for AMX. The support includes both userspace use of AMX as well as virtualization support for AMX.
The kernel does not require an explicit configuration option for AMX. On AMX capable CPUs, the kernel detects the feature at run-time and enables its use.
One can determine if the current CPU is AMX capable via the following:
$ cpuid -1 | grep AMX AMX-BF16: tile bfloat16 support = true AMX-TILE: tile architecture support = true AMX-INT8: tile 8-bit integer support = true AMX-FP16: FP16 tile operations = true
Or alternatively, from the Flags list:
$ lscpu | grep amx ... amx_bf16 amx_tile amx_int8
When programming for AMX, be aware that the application must first request and thus be granted access to AMX via:
syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
This is because the matrix registers add about 8KiB (on Sapphire Rapids) to the kernel-managed state for processes. For performance reasons, it is best not to have to save/restore these registers unless actually needed by the process!
Once granted, the configuration and usage of AMX can commence.
AMX enjoys support in the Intel compiler, of course, as well as GNU GCC 11[4] and LLVM 12[5]. Be aware of AMX intrinsic libraries of each compiler that can facilitate usage of AMX without having to write all the low level nuts and bolts to make it work.
For a start, you can look at the kernel selftest for AMX as a way to program AMX directly in tools/testing/selftests/x86/amx.c
.
The AMX operations are now being incorporated into software suites, such as oneDNN[6] and libXSMM[7].
Virtualization support for AMX within QEMU is available with Oracle’s QEMU 6.1.1-5. When running on a AMX-capable host, supplying the “-cpu host” option enables AMX features for the guest:
qemu-system-x86_64 -cpu host ...
and then from within the guest, one can check for the existence of AMX features using the cpuid command as outlined above.
The Intel AMX technology is still emerging, so the following are good places to find more information: