Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Think With Videos For Agentic Long-Video Understanding

Created by
  • Haebom

Author

Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou

VideoExplorer: Long Video Understanding with Iterative Reasoning

Outline

This paper presents the VideoExplorer framework, proposed to address long-form video understanding (LVU), a challenging problem in computer vision. Based on the principle of "thinking with videos," VideoExplorer naturally connects planning, temporal grounding, and scalable recognition to achieve a consistent inference process. Instead of inferring from static contexts, VideoExplorer iteratively constructs subquestions, identifies relevant viewpoints, and performs task-oriented, temporally scalable video understanding until it reaches a final answer, enabling accurate, efficient, and interpretable inference. Furthermore, to address the limited resources available for LVU training, we build a long-form video inference dataset using difficulty-adaptive sampling to ensure high-quality trajectories for complex tasks. Based on this dataset, we design a two-stage training pipeline: supervised trajectory initialization and trajectory-level preference optimization. This pipeline encourages adaptive temporal grounding and iterative information integration guided by downstream rewards.

Takeaways, Limitations

A New Framework for Solving Long-Length Video Understanding (LVU) Problems: VideoExplorer
Adopting an iterative reasoning approach based on the principle of "thinking with videos"
Building a new dataset and designing a training pipeline for LVU training.
Demonstrated superior performance in existing LVU benchmarks
No specific Limitations mentioned in the paper
👍