APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

doi:doi:10.57702/i1dmoyg7

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer’s weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model.

BibTex:

@dataset{Ziyi_Guan_and_Hantao_Huang_and_Yupeng_Su_and_Hong_Huang_and_Ngai_Wong_and_Hao_Yu_2024,
    abstract = {Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer’s weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model.},
    author = {Ziyi Guan and Hantao Huang and Yupeng Su and Hong Huang and Ngai Wong and Hao Yu},
    doi = {10.57702/i1dmoyg7},
    institution = {No Organization},
    keyword = {'Attention-aware', 'Large Language Models', 'Post-Training Mixed-Precision Quantization'},
    month = {dec},
    publisher = {TIB},
    title = {APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models},
    url = {https://service.tib.eu/ldmservice/dataset/aptq--attention-aware-post-training-mixed-precision-quantization-for-large-language-models},
    year = {2024}
}