APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer’s weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model.

Data and Resources

Cite this as

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu (2024). Dataset: APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models. https://doi.org/10.57702/i1dmoyg7

DOI retrieved: December 16, 2024

Additional Info

Field Value
Created December 16, 2024
Last update December 16, 2024
Defined In https://doi.org/10.1145/3649329.3658498
Author Ziyi Guan
More Authors
Hantao Huang
Yupeng Su
Hong Huang
Ngai Wong
Hao Yu