Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

FP4 All the Way: Fully Quantized Training of LLMs

Created by
  • Haebom

Author

Brian Chmiel, Maxim Fishman, Ron Banner, Daniel Soudry

Outline

This paper demonstrates fully quantized learning (FQT) of large-scale language models (LLMs) for the first time, primarily using 4-bit floating-point (FP4) precision for all weights, activation functions, and gradients. Using a dataset of up to 200 billion tokens, we extensively explore key design choices for FP4, including block size, scaling format, and rounding method. Our analysis reveals that the NVFP4 format, where blocks of 16 FP4 values (E2M1) share a scale represented in E4M3, yields optimal results. Stability is enhanced by employing stochastic rounding in the backpropagation and update passes, and nearest-neighbor rounding in the forward pass. Furthermore, we identify theoretical and empirical thresholds for effective quantized learning. When the gradient norm falls below approximately $\sqrt{3}$ times the quantization noise, quantized learning becomes less efficient. Leveraging these insights, we successfully train a 7-billion-parameter model using 256 Intel Gaudi2 accelerators. As a result, models trained with FP4 achieved subtask performance comparable to the standard BF16 baseline, demonstrating that FP4 learning is a practical and highly efficient approach for large-scale LLM training. A reference implementation is available in https://github.com/Anonymous1252022/fp4-all-the-way .

Takeaways, Limitations

Takeaways:
We demonstrate for the first time the feasibility of fully quantized learning using 4-bit floating point (FP4) in large-scale language model training.
Achieving efficient and stable FP4-based LLM learning by leveraging the NVFP4 format, probabilistic rounding, and nearest rounding techniques.
Presentation of theoretical and experimental thresholds for the efficiency of quantization learning.
We demonstrate the practicality of FP4-based learning by achieving performance comparable to the BF16 baseline.
Reproducibility is ensured through a public reference implementation.
Limitations:
The threshold presented in this paper ($\sqrt{3}$ times the quantization noise) may be a result for a specific setting, and further research may be needed for other models or datasets.
Experimental results using 256 Intel Gaudi2 accelerators are hardware-dependent and performance on other hardware may vary.
Further research is needed to determine the generalizability of the currently published implementation and its applicability to various model architectures.
Although we used a dataset of up to 200 billion tokens, scalability to larger datasets needs to be verified through further research.
👍