Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

Created by
  • Haebom

Author

Xidong Wang, Dingjie Song, Shunian Chen, Junyin Chen, Zhenyang Cai, Chen Zhang, Lichao Sun, Benyou Wang

Outline

This paper focuses on improving the long-context processing capability of multimodal large-scale language models (MLLMs), which is crucial for advancing video understanding and high-resolution image analysis. This requires systematic improvements in model architecture, data organization, and training strategies, particularly focusing on addressing issues such as performance degradation and high computational costs as the number of images increases. In this paper, we propose a hybrid architecture that integrates Mamba and Transformer blocks, present a data organization method that captures both temporal and spatial dependencies, and employ an incremental training strategy. The proposed model, LongLLaVA, demonstrates an effective balance between efficiency and performance, achieving competitive results across various benchmarks while maintaining high throughput and low memory consumption. Notably, it can process nearly 1,000 images on a single A100 80GB GPU.

Takeaways, Limitations

Takeaways:
We present an effective hybrid architecture, data organization method, and training strategy to enhance MLLM's long-context processing capability.
The LongLLaVA model achieves competitive performance while maintaining high throughput and low memory consumption.
Demonstrating the potential for processing a large number of images on a single GPU, suggesting potential for various multimodal applications.
Limitations:
Further research is needed on the generalization performance of the method presented in this paper.
Applicability and performance evaluation for other types of multimodal data are needed.
Experiments using more diverse and larger datasets may be needed.
👍