Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Scaling RL to Long Videos

Created by
  • Haebom

Author

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

LongVILA-R1-7B: A Full-Stack Framework for Long-Term Video Inference

Outline

This paper introduces a full-stack framework that leverages reinforcement learning to extend the long-duration video inference capabilities of the Visual Language Model (VLM). We integrate (1) LongVideo-Reason, a large-scale dataset of 104,000 long-duration video QA pairs with high-quality inference annotations across diverse domains, including sports, gaming, and vlogging; (2) a two-stage training pipeline that extends VLM using chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) Multi-modal Reinforcement Sequence Parallelism (MR-SP), a long-duration video RL training infrastructure that integrates a vLLM-based engine tuned for sequence parallelism and long-duration videos, and cached video embeddings for efficient rollout and prefilling. LongVILA-R1-7B achieved strong performance on video benchmarks, achieving an accuracy of 65.1% without subtitles and 71.1% with subtitles on VideoMME, consistently outperforming LongVILA-7B across multiple benchmarks. Furthermore, LongVILA-R1-7B supports up to 8,192 video frames per video and configurable FPS settings. The MR-SP system achieved up to 2.1x speedup in long-duration video RL training. Finally, we provide an open training system that supports RL training on various modalities (video, text, audio), various models (VILA and Qwen series), and image and video generation models. We support RL training on hour-long videos (e.g., 3,600 frames) on a single A100 node (8 GPUs).

Takeaways, Limitations

Takeaways:
Presenting a large-scale dataset and reinforcement learning-based framework for long-term video inference.
Demonstrated excellent performance in various domains
Development of an efficient infrastructure for long-term video RL training (MR-SP)
Providing an open training system that supports various models and modalities
Limitations:
Limitations is not mentioned in the paper (not described in the Abstract)
👍