Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Audio-centric Video Understanding Benchmark without Text Shortcut

Created by
  • Haebom

Author

Yudong Yang, Jimin Zhuang, Guangzhi Sun, Changli Tang, Yixuan Li, Peihan Li, Yifan Jiang, Wei Li, Zejun Ma, Chao Zhang

Outline

This paper proposes the Audio-centric Video Understanding Benchmark (AVUT), a video understanding benchmark that focuses on audio information. Moving beyond existing visual-centric approaches, it emphasizes context, emotional cues, and semantic information provided by audio as crucial elements for video understanding. AVUT encompasses a variety of tasks that comprehensively assess the understanding of audio content and audiovisual interactions. It also proposes an answer permutation-based filtering mechanism to address the "text shortcut problem" encountered in existing benchmarks, where answers can be inferred solely from the question text. We evaluate various open-source and proprietary multimodal LLMs and analyze their shortcomings. Demos and data are available at https://github.com/lark-png/AVUT .

Takeaways, Limitations

Takeaways:
Introducing AVUT, a new video understanding benchmark that emphasizes the importance of audio information.
Proposing a permutation-based filtering mechanism to solve the "text shortcut problem" of existing benchmarks (Limitations).
Provides comprehensive assessment and analysis of audio-visual comprehension skills across a variety of multimodal LLMs.
A New Direction in Audio-Centric Video Understanding Research
Limitations:
Further research is needed on the universality and scalability of the AVUT benchmark.
Further validation is needed on the effectiveness and generalizability of the proposed answer permutation-based filtering mechanism.
Limitations exist regarding the types and diversity of multimodal LLMs used in the evaluation.
👍