This paper focuses on improving the long-context processing capability of multimodal large-scale language models (MLLMs), which is crucial for advancing video understanding and high-resolution image analysis. This requires systematic improvements in model architecture, data organization, and training strategies, particularly focusing on addressing issues such as performance degradation and high computational costs as the number of images increases. In this paper, we propose a hybrid architecture that integrates Mamba and Transformer blocks, present a data organization method that captures both temporal and spatial dependencies, and employ an incremental training strategy. The proposed model, LongLLaVA, demonstrates an effective balance between efficiency and performance, achieving competitive results across various benchmarks while maintaining high throughput and low memory consumption. Notably, it can process nearly 1,000 images on a single A100 80GB GPU.