|
← back
LiLo: Harnessing the on-Chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
Jan 31, 2026 by Hyungyo Kim, Qirong Xia, Jinghan Huang, Nachuan Wang, Younjoo Lee, Jung Ho Ahn, W. Feghali, Ren Wang, Nam Sung Kim (International Symposium on High-Performance Computer Architecture)
DOI 10.1109/HPCA68181.2026.11408577
We built LiLo to squeeze huge compressed LLMs through Intel CPU on-chip accelerators, using IAA for on-demand decompression and orchestrating IAA, AVX, and AMX to keep throughput high while offloading parameters to storage; the MoE-aware selective compression further cuts decompression overhead so you get up to ~5× latency wins for 400B-scale models on CPU-only servers.
source S2, crossref
|