Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization

Created by
  • Haebom

Author

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra

Outline

This paper explores the optimal bit width to achieve an optimal trade-off between quantized model size and accuracy. We present ParetoQ, a unified framework that comprehensively compares 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. We find a learning transfer between 2-bit and 3-bit, and ParetoQ outperforms all previous methods tuned for specific bit widths. The ParetoQ ternary 600M-parameter model outperforms the previous SoTA ternary 3B-parameter model, and the ternary, 2-bit, and 3-bit quantizations exhibit comparable performance in the size-accuracy trade-off, demonstrating the potential for 2-bit quantization to reduce memory and improve speed.

Takeaways, Limitations

Takeaways:
We present ParetoQ, a unified framework for comprehensively comparing different bit widths (1 bit, 1.58 bit, 2 bit, 3 bit, and 4 bit).
Discovered learning transition between 2-bit and 3-bit.
The ParetoQ ternary 600M-parameter model outperforms the existing SoTA 3B-parameter model.
Ternary, 2-bit, and 3-bit quantizations show competitive performance in size-accuracy tradeoffs.
2-bit quantization has the potential to reduce memory and improve speed.
Limitations:
There is no direct mention of Limitations in the paper.
👍