Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning

Created by
  • Haebom

Author

Haonan Ge, Yiwei Wang, Kai-Wei Chang, Hang Wu, Yujun Cai

FrameMind: Dynamic Video Understanding with Frame-Interleaved Reasoning

Outline

This paper introduces FrameMind, a framework that dynamically requests visual information using reinforcement learning to overcome the limitations of existing video understanding models that rely on fixed frame sampling strategies. FrameMind alternates between text inference and active visual recognition via Frame-Interleaved Chain-of-Thought (FiCOT) and is trained using the Dynamic Resolution Frame Sampling (DRFS) and DRFS-GRPO algorithms. This method outperforms existing models on benchmarks such as MLVU and VideoMME.

Takeaways, Limitations

Takeaways:
Enhancing the flexibility and efficiency of video understanding models through dynamic visual information requests.
Improving the interaction between text inference and visual recognition through the FiCOT method.
Training effective dynamic sampling policies using DRFS and DRFS-GRPO.
Achieved SOTA on MLVU and VideoMME benchmarks.
Limitations:
Complexity and computational cost of DRFS and DRFS-GRPO.
Generalizability of FiCOT and dynamic sampling.
Performance evaluation on other types of video understanding tasks is needed.
👍