MoCHA is a novel visual framework proposed to address the high training and inference costs of vision large-scale language models (VLLMs) and the difficulty of extracting visual details. It integrates four vision backbones: CLIP, SigLIP, DINOv2, and ConvNeXt, to extract complementary visual features. The sparse expert mixture connector (MoECs) module dynamically selects experts tailored to different visual dimensions. Furthermore, it utilizes Hierarchical Group Attention (HGA) and adaptive gating strategies to mitigate redundant or underutilized visual information encoded by the MoECs module. MoCHA was trained on leading LLMs, such as Phi2-2.7B and Vicuna-7B, and its performance was evaluated on various benchmarks. MoCHA outperformed state-of-the-art open-weighted models on several tasks. Specifically, compared to CuMo (Mistral-7B), MoCHA (Phi2-2.7B) demonstrated a 3.25% improvement in hallucination reduction on the Predictive Image Processing (POPE) scale and a 153-point improvement in visual instruction following on the Multi-Means Evaluation (MME) scale. Additional ablation studies confirmed the effectiveness and robustness of the proposed MoECs and HGA.