This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Xuan Shen, Peiyan Dong, Zhenglun Kong, Yifan Gong, Changdi Yang, Zhaoyang Han, Yanyue Xie, Lei Lu, Cheng Lyu, Chao Wu, Yanzhi Wang, Pu Zhao
Outline
In this paper, we propose Squat, a quantization-aware learning (QAT) framework for efficient small language models (SLMs) on mobile devices. We point out that existing QAT methods focus on large-scale models on GPUs and are not optimized for SIMD instructions on mobile devices. Squat mitigates attention distortion due to quantization through entropy-based distillation and distribution-aligned distillation, and uses sub-8-bit token-adaptive quantization that allocates variable bit widths according to token importance. In addition, we develop a SIMD-based multiple-kernel mixed-precision (MKMP) multiplier that supports sub-8-bit mixed-precision MAC operations on mobile devices. Experimental results show that Squat outperforms other QAT methods, achieving up to 2.37x speedup compared to FP16.
Takeaways, Limitations
•
Takeaways:
◦
A novel QAT framework (Squat) for building efficient small language models on mobile devices
◦
Quantization distortion mitigation via entropy-based distillation and distribution-aligned distillation
◦
Efficient quantization through variable bit width allocation according to token importance
◦
Optimization for mobile devices using SIMD-based MKMP multipliers
◦
Achieved up to 2.37x speedup compared to FP16
•
Limitations:
◦
Squat's performance improvements may be limited to specific datasets and hardware environments.
◦
Further research is needed on the generalizability of sub-8-bit quantization.
◦
Compatibility verification is required for various mobile devices and operating systems.