This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper highlights that training omnimodal Large Language Models (LLMs) remains a significant challenge due to the heterogeneous model architectures required to handle diverse modalities, necessitating sophisticated system design for large-scale training. Existing frameworks typically intertwine model definition and parallel logic, limiting the scalability and engineering overhead of end-to-end omnimodal training. In this paper, we present VeOmni, a modular and efficient training framework for accelerating omnimodal LLM development. VeOmni introduces model-centric distributed recipes that decouple communication from computation, enabling efficient 3D parallel processing in omnimodal LLMs. It also provides a flexible configuration interface that allows seamless integration of new modalities with minimal code changes. We demonstrate that using VeOmni, an omnimodal Mixture-of-Experts (MoE) model with 30B parameters can be trained at 2,800 tokens/second/GPU throughput and scale to 160K context lengths with 3D parallelism on 128 GPUs. This demonstrates excellent efficiency and scalability for large-scale omnimodal LLM training.
Takeaways, Limitations
•
Takeaways:
◦
We present VeOmni, a novel framework that significantly improves the efficiency and scalability of omnimodal LLM training by decoupling model definition and communication.
◦
Enabling large-scale omnimodal LLM training through 3D parallel processing.
◦
Easy integration of new modalities through a flexible configuration interface.
◦
Experimental results demonstrate VeOmni's superior performance and scalability.
•
Limitations:
◦
Further research is needed on the practical applications of VeOmni and its generalizability to various omnimodal LLM architectures.
◦
Possibly optimized for a specific hardware environment, requires verification of portability to other hardware environments.
◦
Further experiments and analysis are needed to determine the efficiency and stability of training on very large models.