Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Why Do Some Inputs Break Low-Bit LLM Quantization?

Created by
  • Haebom

Author

Ting-Yun Chang, Muru Zhang, Jesse Thomason, Robin Jia

Outline

This paper analyzes the phenomenon where low-bit weight-only quantization significantly reduces the memory footprint of large-scale language models (LLMs), but disproportionately impacts certain examples. We analyze LLMs ranging in size from 7 to 70 bits, applying various 3- and 4-bit quantization methods. We find that the quantization errors of 50 pairs of methods exhibit a strong correlation (average 0.82) on the FineWeb example. Furthermore, we demonstrate that the residual stream size of a full-precision model is an indicator of future quantization error. We hypothesize a relationship between residual stream size and error amplification and accumulation across layers. Using LLM localization techniques, early termination, and active patches, we show that examples with large errors rely on precise residual activation in later layers, and that the output of MLP gates plays a crucial role in maintaining perplexity. In conclusion, this study identifies the reasons why large quantization errors occur on certain examples and the most important model components for maintaining performance.

Takeaways, Limitations

Takeaways:
We present the predictability of errors occurring in low-bit quantization and identify the causes of error occurrence.
Understanding the importance of specific layers and components of LLM can contribute to developing efficient quantization strategies.
We propose the possibility of developing quantization error prediction and mitigation strategies using residual stream size.
We suggest ways to mitigate quantization errors through techniques such as LLM localization, early termination, and active patches.
Limitations:
Further validation of the generalizability of the dataset (FineWeb) used in the analysis is needed.
The theoretical basis for the proposed hypothesis (residual stream size and error amplification/accumulation relationship) needs to be strengthened.
Further experiments with different LLM architectures and quantization methods are needed.
Further experimental verification is needed to determine the actual performance improvement of the proposed error mitigation technique.
👍