Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Oscillation-Reduced MXFP4 Training for Vision Transformers

Created by
  • Haebom

Author

Yuxiang Chen, Haocheng Xi, Jun Zhu, Jianfei Chen

Outline

This paper focuses on the speedup potential of pre-training Transformers in FP4 precision, but proposes a novel training method, called TetraJet, to address the accuracy degradation issue. We identify the weight oscillation issue as the main cause of the accuracy degradation in training with the conventional MXFP4 data format, and propose two new methods, called EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to address it. Through extensive experiments on Vision Transformers, we demonstrate that it outperforms conventional 4-bit training methods, reduces the accuracy degradation by more than 50% compared to the baseline model, and is even competitive with full precision training.

Takeaways, Limitations

Takeaways:
We systematically analyze the causes of accuracy degradation problems that occur in FP4 precision training and propose solutions.
Effectively alleviate weight oscillation problems through Q-EMA and Q-Ramping to improve accuracy.
Achieves superior performance compared to existing 4-bit training methods and competitive performance with full-precision training.
We present an efficient training method using the MXFP4 data format.
Limitations:
Currently, only experimental results for Vision Transformers are presented, and generalizability to other model architectures requires further study.
The effectiveness of the presented method may depend on specific hardware (Blackwell GPU).
Experiments with more diverse and complex models and datasets are needed.
👍