This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
LVBench: An Extreme Long Video Understanding Benchmark
Created by
Haebom
Author
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang
Outline
This paper highlights the limitations of existing multimodal large-scale language models and evaluation datasets focused on short-form video understanding (less than one minute), highlighting their inability to meet the needs of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and real-time sports commentary, which require understanding long-form videos. To address this, we propose LVBench, a novel benchmark for long-form video understanding. LVBench consists of a variety of publicly available videos and tasks targeting long-form video understanding and information extraction, designed to evaluate the long-term memory and extended understanding capabilities of multimodal models. Experimental results demonstrate that current multimodal models still underperform on these challenging long-form video understanding tasks. LVBench is intended to stimulate the development of more advanced models that can address the complexities of long-form video understanding, and its data and code are publicly available.
Takeaways, Limitations
•
Takeaways: We present LVBench, a new benchmark for long-term image understanding, clarifying the limitations of existing models and suggesting future research directions. The publicly available dataset and code can accelerate the development of multimodal models. This provides a crucial foundation for the development of long-term image understanding technologies required for real-world applications.
•
Limitations: LVBench is still in its early stages, so it needs to incorporate more diverse types of long-term videos and tasks. In-depth analysis of the causes of the current model's poor performance is lacking. Dataset expansion is needed to account for diverse language and cultural backgrounds.