Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Scaling RL to Long Videos

Created by
  • Haebom

Author

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

Outline

In this paper, we present a full-stack framework that extends the inference capabilities of Video Language Models (VLMs) to long videos using reinforcement learning. Using a large-scale dataset LongVideo-Reason consisting of 52,000 long video question-answer pairs, we build a two-stage training pipeline that integrates CoT-SFT and reinforcement learning (RL). We also develop a training infrastructure for long video RL called Multi-Modal Reinforced Sequence Parallel Processing (MR-SP) that uses cached video embeddings for efficient rollout and prefilling. Experimental results show that LongVILA-R1-7B performs strongly on long video QA benchmarks such as VideoMME, outperforming Video-R1-7B and achieving comparable performance to Gemini-1.5-Pro in temporal reasoning, goal and objective reasoning, spatial reasoning, and plot reasoning. In addition, the MR-SP system improves the long-form video RL learning speed up by up to 2.1 times, and LongVILA-R1 shows consistent performance improvement as the number of input video frames increases. Finally, we present a learning system for RL learning that supports various modalities (video, text, audio), models (VILA and Qwen series), and image and video generation models.

Takeaways, Limitations

Takeaways:
A novel full-stack framework that significantly improves the inference capability of VLMs for long videos is presented.
Large-Scale Long-Length Video QA Dataset LongVideo-Reason Released
An effective two-stage learning pipeline combining CoT-SFT and RL is presented.
Development and release of MR-SP, an efficient infrastructure for long-form video RL learning (up to 2.1x speedup)
Achieve superior performance over existing models (VideoMME, LongVideo-Reason-eval)
RL learning system supporting various modalities and models revealed
Limitations:
Further validation of the diversity and scale of the LongVideo-Reason dataset is needed.
Need to evaluate the scalability of MR-SP system and its performance in different hardware environments
Further research is needed on the computational cost and training time of RL-based approaches.
Further research is needed on the possibility of performance optimization for specific benchmarks and generalization performance.
👍