Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

VidBridge-R1: Bridging QA and Captioning for RL-based Video Understanding Models with Intermediate Proxy Tasks

Created by
  • Haebom

Author

Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Weihong Lin, Zekun Wang, Bohan Zeng, Yang Shi, Sihan Yang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

Outline

Approaches that combine the "Reason-Then-Respond" paradigm with reinforcement learning have contributed to the development of multimodal large language models. However, when applied to the video domain, they have produced models specialized for either question answering (QA) or captioning, struggling to perform both tasks. Due to the conflicting nature of these tasks, simply combining the reward signals from the two tasks results in poor performance. To address this issue, this paper proposes a novel learning framework based on two intermediate proxy tasks: DarkEventInfer and MixVidQA. DarkEventInfer presents a video with masked event segments, requiring the model to infer the masked content based on contextual video cues. MixVidQA presents an interleaved video sequence composed of two distinct clips, requiring the model to isolate and infer one while ignoring the other. This framework promotes the development of both holistic and divergent understanding and accurate and convergent inference capabilities. VidBridge-R1, which implements this framework, is the first multi-objective video inference model that effectively addresses this paradigm conflict. Extensive experiments demonstrate that VidBridge-R1 achieves significant performance improvements in both QA and captioning within a single model, demonstrating the effectiveness of the proposed approach in fostering more generalized and robust video understanding models.

Takeaways, Limitations

Takeaways:
We achieved significant performance improvements on both QA and captioning tasks within a single model.
Contributed to generalization and performance improvement of video understanding models.
We present a new learning framework that addresses the paradigm conflict problem.
We improved the model's understanding ability with new proxy tasks called DarkEventInfer and MixVidQA.
Limitations:
Limitations is not mentioned in the paper itself. (This means that there is no content related to Limitations in the paper.)
👍