Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models

Created by
  • Haebom

Author

Jungwoo Park, Taewoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang

Outline

In this paper, we propose Outlier-Safe Pre-Training (OSP), a proactive approach rather than a post-mitigation approach to address the problem of extreme activation outliers that degrade the quantization performance of large-scale language models (LLMs). OSP combines three key innovations: the Muon optimizer, the Single-Scale RMSNorm, and the learnable embedding projection to proactively prevent outlier generation. We train a 1.4 billion-parameter model with 1 trillion tokens, and achieve an average score of 35.7 on ten benchmarks (vs. 26.5 for the Adam-trained model) under aggressive 4-bit quantization, with only 2% training overhead. This demonstrates that outliers in LLMs are artifacts of the training strategy and not inherent properties. The source code and pre-trained checkpoints are available on GitHub.

Takeaways, Limitations

Takeaways:
Presentation of OSP technique to effectively solve extreme activation outlier problem of LLM.
Performance improvement over existing methods in a 4-bit quantization environment (35.7 vs 26.5).
Low training overhead (2%).
We demonstrate that outliers are a consequence of the training strategy and not an inherent property of LLM.
Presenting new possibilities for efficient LLM distribution.
Validation in a real-world environment (1 trillion tokens, 1.4 billion parameters).
Source code and pre-trained models available.
Limitations:
Further studies are needed to determine whether the effects of OSP can be generalized to other scales of LLM or other quantization techniques.
A more detailed analysis of the interactions between the three innovation factors presented is needed.
Since these are evaluation results for a specific benchmark, further research is needed to determine generalizability.
👍