Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

ICQuant: Index Coding enables Low-bit LLM Quantization

Created by
  • Haebom

Author

Xinlin Li, Osama Hanna, Christina Fragouli, Suhas Diggavi

Outline

This paper presents ICQuant, an efficient, low-bit post-training quantization (PTQ) technique to address the high memory overhead of large-scale language models (LLMs). To overcome the limitations of existing outlier suppression techniques, which either fail to effectively reduce the quantization range or incur large bit overhead, ICQuant adopts an efficient index coding scheme that leverages outlier statistics. It reduces the quantization range with significantly less bit overhead (approximately 0.3 bits) than existing techniques and can be applied additionally to existing quantization techniques to enhance performance. Experimental results show that ICQuant improves the zero-shot accuracy of the Llama3-70B model by up to 130% to 150% compared to existing techniques (QTIP, QuIP#) with only 2.3 bits/weight, achieving performance comparable to that of a fine-tuned, top-performing quantization technique (PV-tuning) without any fine-tuning.

Takeaways, Limitations

Takeaways:
We present ICQuant, a novel framework for efficient low-bit post-training quantization.
Reduced quantization range with lower bit overhead (approximately 0.3 bits) compared to existing techniques.
Presenting compatibility with existing quantization techniques and potential for performance improvement.
Achieve excellent performance without fine-tuning.
Effective even in extreme compression environments (2-3 bits per weight).
Limitations:
Currently, it is applied only to weights, and further research on activation function quantization is needed.
The presented experimental results are limited to a specific model (Llama3-70B) and limited settings. Further verification of generalization performance across a wider range of models and settings is needed.
Further analysis of the complexity and computational cost of index coding schemes is needed.
👍