Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding

Created by
  • Haebom

Author

Jianxiang He, Meisheng Hong, Jungang Li, Yijie Xu, Ziyang Chen, Weiyu Guo, Hui Xiong

Outline

This paper proposes Visual-Subtitle Integration (VSI), an efficient keyframe retrieval method for long-form video understanding. To address the limitations of existing keyframe retrieval methods, which include weak multimodal alignment between textual queries and visual content and failure to capture complex temporal semantic information, VSI integrates subtitles, timestamps, and scene boundaries into a unified multimodal retrieval process. It leverages both visual and complementary textual information in video frames through a video retrieval stream and a subtitle matching stream, and enhances keyframe retrieval accuracy through the interaction of the two streams. On the LongVideoBench dataset, VSI significantly outperforms competing methods in keyframe location accuracy and the long-form video question-answering (Video-QA) task, achieving state-of-the-art performance.

Takeaways, Limitations

Takeaways:
We demonstrate the effectiveness of multi-modal keyframe search using subtitle, timestamp, and scene boundary information.
An efficient and accurate keyframe search method for long-term video understanding is presented.
Achieving SOTA performance on the LongVideoBench dataset.
Verifying the robustness and generalizability of multi-modal search strategies.
Limitations:
Further research is needed to evaluate generalizability by evaluating performance on a specific dataset (LongVideoBench).
Further analysis of the computational complexity and efficiency of VSI is needed.
Performance evaluation is needed for various types of long-term videos.
Limited applicability to videos without subtitles.
👍