This page organizes papers related to artificial intelligence published around the world. This page is summarized using Google Gemini and is operated on a non-profit basis. The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.
ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training
Created by
Haebom
Author
Adel Nabli (MLIA, Mila), Louis Fournier (MLIA), Pierre Erbacher (MLIA), Louis Serrano (MLIA), Eugene Belilovsky (Mila), Edouard Oyallon (MLIA)
Outline
This paper proposes ACCO (Accumulate while Communicate), a novel distributed optimization algorithm that reduces communication overhead and improves memory efficiency during data parallel processing during large-scale language model (LLM) training. ACCO synchronizes delayed gradients while computing new gradients, reducing GPU idle time and supporting heterogeneous hardware. Furthermore, it introduces a technique that aligns training dynamics with standard distributed optimization to mitigate convergence issues caused by delayed updates. The proposed algorithm is significantly faster than ZeRO-1 and scales effectively across heterogeneous hardware environments.
Takeaways, Limitations
•
Takeaways:
◦
Reducing communication overhead in distributed LLM training
◦
Proposing a memory-efficient optimization algorithm
◦
Reduced GPU idle time and heterogeneous hardware support
◦
Introducing new techniques to solve convergence problems
◦
Demonstrated performance improvements and scalability compared to ZeRO-1
•
Limitations:
◦
Lack of information on specific performance improvement figures and experimental environments.
◦
Algorithm complexity and implementation difficulty
◦
Lack of comparative analysis with other distributed optimization algorithms.