Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training

Created by
  • Haebom

Author

Zeyu Liu, Yan Li, Yunquan Zhang, Boyang Zhang, Guoyong Jiang, Xin Zhang, Limin Xiao, Weifeng Zhang, Daning Cheng

A block coordinate descent-based framework for efficient large-scale language model training.

Outline

In this paper, we propose a full-parameter pre-training and fine-tuning framework based on Block Coordinate Descent (BCD) for small- to medium-sized teams struggling to train large-scale language models due to GPU memory and financial investment requirements. This framework is designed to efficiently train large-scale models on RTX 4090, A100, and A800 GPU clusters through engineering optimizations. Compared to standard full-parameter training methods, we reduce the training cost of the 7B model by 33% on the A100/A800 and by 2.6% on the RTX 4090 under the same hardware environment. Furthermore, this method enables training of large-scale models previously only trainable on the A100 cluster on the RTX 4090 without performance degradation. In most cases, BCD achieves similar or better accuracy than full-parameter and fine-tuning methods, while reducing GPU usage and improving hardware utilization.

Takeaways, Limitations

Takeaways:
Significantly reduce the cost of training large-scale language models by leveraging cost-effective GPUs (RTX 4090).
Models that could only be trained on the A100 can now be trained on the RTX 4090 without any performance degradation.
Reduced GPU usage and improved hardware utilization compared to full-parameter training methods.
Limitations:
Lack of detailed information on specific performance comparison metrics and the degree of accuracy improvement.
Lack of information on the extensibility of the BCD framework and its applicability to various model architectures.
Further research is needed on performance variations depending on actual application cases and environments.
👍