Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

Created by
  • Haebom

Author

Yuqi Pang, Bowen Yang, Yun Cao, Rong Fan, Xiaoyu Li, Chen He

Outline

MoCHA is a novel visual framework proposed to address the high training and inference costs of vision large-scale language models (VLLMs) and the difficulty of extracting visual details. It integrates four vision backbones: CLIP, SigLIP, DINOv2, and ConvNeXt, to extract complementary visual features. The sparse expert mixture connector (MoECs) module dynamically selects experts tailored to different visual dimensions. Furthermore, it utilizes Hierarchical Group Attention (HGA) and adaptive gating strategies to mitigate redundant or underutilized visual information encoded by the MoECs module. MoCHA was trained on leading LLMs, such as Phi2-2.7B and Vicuna-7B, and its performance was evaluated on various benchmarks. MoCHA outperformed state-of-the-art open-weighted models on several tasks. Specifically, compared to CuMo (Mistral-7B), MoCHA (Phi2-2.7B) demonstrated a 3.25% improvement in hallucination reduction on the Predictive Image Processing (POPE) scale and a 153-point improvement in visual instruction following on the Multi-Means Evaluation (MME) scale. Additional ablation studies confirmed the effectiveness and robustness of the proposed MoECs and HGA.

Takeaways, Limitations

Takeaways:
A novel framework is presented to effectively address the high cost problem of VLLM.
Performance enhancement through complementary visual feature extraction.
Increasing the efficiency of visual information utilization through MoECs and HGA modules.
Reduced hallucinations and improved performance in following visual instructions.
Achieving SOTA performance across various benchmarks.
Limitations:
Further research is needed to explore the generalizability of the proposed framework.
Need to assess dependencies on specific LLMs and compatibility with other LLMs.
More diverse and comprehensive benchmark evaluations are needed.
Lack of detailed explanation of parameter tuning of MoECs and HGA modules.
👍