[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Team of One: Cracking Complex Video QA with Model Synergy

Created by
  • Haebom

Author

Jun Xie, Zhaoran Zhao, Xiongjun Guan, Yingjian Zhu, Hongzhu Yi, Xinming Wang, Feng Chen, Zhepeng Wang

Outline

In this paper, we propose a novel framework for open-ended video question answering that improves inference depth and robustness in complex real-world scenarios on the CVRR-ES dataset. Existing Video-Large Multimodal Models (Video-LMMs) suffer from limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or constructive questions. To address these issues, we present a prompting and response integration mechanism that coordinates multiple heterogeneous Video-Luage Models (VLMs) tailored to different inference paths through a structured thought chain. An external Large Language Model (LLM) acts as an evaluator and integrator, selecting and merging the most reliable responses. Extensive experiments demonstrate that the proposed method significantly outperforms existing baseline models in all evaluation metrics, demonstrating excellent generalization and robustness. Our approach provides a lightweight and scalable strategy for advancing multimodal inference without model retraining, and provides a solid foundation for future Video-LMM developments.

Takeaways, Limitations

Takeaways:
We present a novel framework that improves inference depth and robustness in open-ended video question answering.
Solve the problems of lack of context understanding, weak temporal modeling, and poor generalization ability of existing Video-LMM (__T62534_____).
Improved performance through a prompting and response integration mechanism that coordinates multiple heterogeneous VLMs.
Providing a lightweight, scalable strategy for multimodal inference development without model retraining.
Demonstrated performance that significantly outperforms existing baseline models in all evaluation metrics.
Limitations:
The performance of the proposed framework may depend on the performance of the LLMs and VLMs used.
Since only the performance on the CVRR-ES dataset is presented, the generalization performance on other datasets requires further study.
Further analysis is needed on the role and reliability of external LLMs.
Potential increased computational cost due to the complexity of the prompting and response integration mechanism.
👍