Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis

Created by
  • Haebom

Author

Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng

ReWatch: A New Dataset and RLVR Framework for Video Reasoning

Outline

This paper presents a study on the advancement of complex video inference using Large Vision-Language Models (LVLM). To overcome the limitations of existing datasets, we propose a large-scale dataset, ReWatch. ReWatch consists of three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT. It generates video-based inference traces using the Multi-Agent ReAct framework. Furthermore, we developed ReWatch-R1 using the Supervised Fine-Tuning (SFT) and RLVR frameworks, and it includes an Observation & Reasoning (O&R) reward mechanism that evaluates the accuracy of the final answer and its consistency with the video content. Experimental results show that ReWatch-R1 achieves state-of-the-art performance on five video inference benchmarks.

Takeaways, Limitations

Takeaways:
We developed a new large-scale dataset, ReWatch, for complex video inference and used it to improve the performance of RLVR-based models.
We generated video-based CoT data by mimicking human-like reasoning processes using the Multi-Agent ReAct framework.
By introducing the O&R compensation mechanism, we directly control hallucination and improve the accuracy of the model.
We demonstrate the effectiveness of our proposed methodology by achieving state-of-the-art performance on five video inference benchmarks.
Limitations:
There is no specific mention of Limitations in the paper.
👍