Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Created by
  • Haebom

Author

Adel Nabli (MLIA, Mila), Louis Fournier (MLIA), Pierre Erbacher (MLIA), Louis Serrano (MLIA), Eugene Belilovsky (Mila), Edouard Oyallon (MLIA)

Outline

This paper proposes ACCO (Accumulate while Communicate), a novel distributed optimization algorithm that reduces communication overhead and improves memory efficiency during data parallel processing during large-scale language model (LLM) training. ACCO synchronizes delayed gradients while computing new gradients, reducing GPU idle time and supporting heterogeneous hardware. Furthermore, it introduces a technique that aligns training dynamics with standard distributed optimization to mitigate convergence issues caused by delayed updates. The proposed algorithm is significantly faster than ZeRO-1 and scales effectively across heterogeneous hardware environments.

Takeaways, Limitations

Takeaways:
Reducing communication overhead in distributed LLM training
Proposing a memory-efficient optimization algorithm
Reduced GPU idle time and heterogeneous hardware support
Introducing new techniques to solve convergence problems
Demonstrated performance improvements and scalability compared to ZeRO-1
Limitations:
Lack of information on specific performance improvement figures and experimental environments.
Algorithm complexity and implementation difficulty
Lack of comparative analysis with other distributed optimization algorithms.
👍