Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

PTQ1.61: Push the Real Limit of Extremely Low-Bit Post-Training Quantization Methods for Large Language Models

Created by
  • Haebom

Author

Jiaqi Zhao, Miao Zhang, Ming Wang, Yuzhang Shang, Kaihao Zhang, Weili Guan, Yaowei Wang, Min Zhang

Outline

To address the problem of large-scale language models (LLMs) suffering from severe performance degradation in ultra-low-bit (<2-bit) quantization, this paper proposes PTQ1.61, a novel ultra-low-bit post-training quantization (PTQ) method that enables 1.61-bit weight quantization. While existing methods use more than 1 extra bit per weight, PTQ1.61 introduces a one-dimensional structured mask based on input activations that uses only a negligible 0.0002-bit extra bit, allocating 4 bits to important weight channels, and performs binarization on non-important channels via a block-wise scaling factor optimization framework. Furthermore, we present a novel quantization preprocessing paradigm that alleviates the difficulties of ultra-low-bit channel-specific PTQ by transforming the weight distribution of a pre-trained model before quantization. Experimental results demonstrate that PTQ1.61 achieves state-of-the-art performance in ultra-low-bit quantization.

Takeaways, Limitations

Takeaways:
It presents the possibility of drastically reducing the memory usage and computational load of LLM through ultra-low bit quantization of 1.61 bits.
We present a new ultra-low bit rate PTQ method that overcomes the limitations of conventional mix-precision methods.
We present a novel approach to address the challenges of ultra-low-bit quantization through a new paradigm called quantization preprocessing.
Experimental results verify the excellent performance of PTQ1.61.
Limitations:
Further research is needed to determine whether the proposed method guarantees the same performance for all types of LLMs.
Consideration needs to be given to the practical implementation and hardware support of 1.61-bit quantization.
Further research is needed to determine the generalizability of the proposed quantization preprocessing step.
👍