Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Improving Quantization with Post-Training Model Expansion

Created by
  • Haebom

Author

Giuseppe Franco, Pablo Monteagudo-Lago, Ian Colbert, Nicholas Fraser, Michaela Blott

Outline

This paper presents a method to improve the performance of quantized models by increasing their size through post-training optimization. While existing quantization techniques focus on reducing model size, this paper proposes a strategy to expand the model to compensate for the performance degradation caused by the quantization process. Specifically, by quantizing the Llama3 1B model to 4 bits and increasing the model size by 5%, we achieve an average 9% improvement in perplexity reduction compared to QuaRot and SpinQuant, and a 3.8% size reduction compared to the BF16 baseline model. These results demonstrate that post-training model expansion is a viable strategy for improving model performance within the quantization co-design space.

Takeaways, Limitations

Takeaways:
We demonstrate that post-training model expansion can effectively mitigate the performance degradation caused by quantization.
A novel approach is presented to find the optimal balance between performance and efficiency by adjusting the model size during the quantization process of LLM.
Provides an efficient way to improve model performance without requiring full retraining.
Limitations:
Currently, only the results for the Llama3 1B model are presented, so generalizability to other models or quantization bit counts is limited.
There is a lack of specific guidance on how to optimize model scaling strategies and determine the scale of scaling.
Lack of quantitative analysis of the additional memory and computational costs resulting from model extension.
👍