Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

FameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning

Created by
  • Haebom

Author

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai

Outline

FrameMind is an end-to-end reinforcement learning framework based on the Frame-Interactive Chain of Thought (FiCOT), developed to overcome the limitations of existing video understanding models that rely on fixed frame sampling strategies. It alternates between text inference and active visual recognition, utilizing tools to extract specific frames or video clips based on knowledge gaps. The dynamic sampling policy, learned through Dynamic Resolution Frame Sampling (DRFS) and DRFS-GRPO, learns various spatiotemporal trade-offs and learns from outcome-based rewards without frame-level annotations. It has demonstrated superior performance compared to existing models on benchmarks such as MLVU and VideoMME.

Takeaways, Limitations

Takeaways:
Improving video understanding performance through dynamic visual information requests using reinforcement learning.
Interaction between text inference and visual perception through FiCOT.
Learning effective dynamic sampling policies using DRFS and DRFS-GRPO.
Achieved SOTA on MLVU and VideoMME benchmarks.
Limitations:
Further research is needed on the ability to identify specific knowledge gaps and select appropriate frames.
Further evaluation of the generalization ability of DRFS and DRFS-GRPO is needed.
Scalability verification for complex video data is required.
Computational cost and training time.
👍