|
← back
Optimization of a GEMM Implementation using Intel AMX
Jan 25, 2026 by Yusuke Endo, Satoshi Ohshima, T. Nanri (SCA/HPC Asia)
DOI 10.1145/3773656.3773660
We implemented a BFloat16 GEMM on Intel AMX and pushed it with blocking and tile-register-level tweaks to squeeze real-world wins out of the accelerator. End result: a clean, practical AMX path that beats MKL/OpenBLAS BFloat16 GEMM by about 7–20% while keeping the code intuition and tuning strategy shareable.
source S2, crossref
|