Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

IPFormer-VideoLLM: Enhancing Multi-modal Video Understanding for Multi-shot Scenes

Created by
  • Haebom

Author

Yujia Liang, Jile Jiao, Xuetao Feng, Zixuan Ye, Yuan Wang, Zhicheng Wang

Outline

In this paper, we present a new dataset, MultiClip-Bench, which features dense description and instruction-based question-answer pairs tailored for multi-shot scenarios, to address the challenges faced by existing Video Large Language Models (VideoLLMs) in multi-shot scenarios (video clips containing different camera angles or scene changes). We analyze the problem that existing models incompletely encode object information, and propose a new model, IPFormer-VideoLLM, which injects object-level features as instance prompts via an efficient attention-based concatenation. Experimental results demonstrate that the proposed dataset and models significantly improve multi-scene video understanding and provide distinct advantages on various video benchmarks.

Takeaways, Limitations

Takeaways:
Introducing MultiClip-Bench, a new dataset for multi-shot video understanding
Proposing a new model IPFormer-VideoLLM to solve the problem of object information loss
Improved multi-scene video understanding performance and demonstrated superior performance on various benchmarks
Limitations:
Additional consideration is needed regarding the size and diversity of the MultiClip-Bench dataset.
Further analysis of the computational cost and efficiency of IPFormer-VideoLLM is needed.
Further research is needed on the generalization performance of the proposed model.
👍