Optimization of a GEMM Implementation using Intel AMX

Optimization of a GEMM Implementation using Intel AMX

Jan 25, 2026 by Yusuke Endo, Satoshi Ohshima, T. Nanri (SCA/HPC Asia)

We implemented a BFloat16 GEMM on Intel AMX and pushed it with blocking and tile-register-level tweaks to squeeze real-world wins out of the accelerator. End result: a clean, practical AMX path that beats MKL/OpenBLAS BFloat16 GEMM by about 7–20% while keeping the code intuition and tuning strategy shareable.

source S2, crossref

dgfl, 2026