This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper highlights that training omnimodal LLMs (large-scale language models) remains a significant challenge due to the heterogeneous model architectures required to handle various modalities, necessitating sophisticated system designs for large-scale training. Existing frameworks typically intertwine model definition and parallel logic, limiting scalability and incurring significant engineering overhead for end-to-end omnimodal training. In response, we present VeOmni, a modular and efficient training framework for accelerating omnimodal LLM development. VeOmni introduces model-centric distributed recipes that decouple communication from computation, enabling efficient 3D parallel processing in omnimodal LLMs. It also features a flexible configuration interface that enables seamless integration of new modalities with minimal code changes. Using VeOmni, we train an omnimodal mixture of experts (MoE) model with 30B parameters at 2,800 tokens/second/GPU throughput and scale to 160K context length with 3D parallelism on 128 GPUs, demonstrating excellent efficiency and scalability for large-scale omnimodal LLM training.
Takeaways, Limitations
•
Takeaways:
◦
We present the VeOmni framework, which significantly improves the efficiency and scalability of omnimodal LLM training.
◦
Decoupling model definition and communication enables efficient large-scale training through 3D parallel processing.
◦
Provides a flexible configuration interface for integrating new modalities.
◦
We experimentally demonstrate that an omnimodal MoE model with 30B parameters can be trained efficiently on 128 GPUs.
•
Limitations:
◦
Further research is needed to determine the practical applicability and generalization performance of the VeOmni framework.
◦
Further performance evaluations are needed for omnimodal LLMs of various scales and in various hardware environments.
◦
There may be a dependency on a specific hardware environment (128 GPUs). Generalization performance in other environments needs to be verified.