In this paper, we propose a novel framework for open-ended video question answering that improves inference depth and robustness in complex real-world scenarios on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) suffer from limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or constructive questions. To address these issues, we present a prompting and response integration mechanism that coordinates multiple heterogeneous Video-Luage Models (VLMs) tailored to different inference paths through a structured thought chain. An external Large Language Model (LLM) acts as an evaluator and integrator, selecting and merging the most reliable responses. Extensive experiments demonstrate that the proposed method significantly outperforms existing baseline models in all evaluation metrics, demonstrating excellent generalization and robustness. Our approach provides a lightweight and scalable strategy for advancing multimodal inference without model retraining, and provides a solid foundation for future Video-LMM developments.