This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
This paper proposes the Audio-centric Video Understanding Benchmark (AVUT), a video understanding benchmark that focuses on audio information. Moving beyond existing visual-centric approaches, it emphasizes context, emotional cues, and semantic information provided by audio as crucial elements for video understanding. AVUT encompasses a variety of tasks that comprehensively assess the understanding of audio content and audiovisual interactions. It also proposes an answer permutation-based filtering mechanism to address the "text shortcut problem" encountered in existing benchmarks, where answers can be inferred solely from the question text. We evaluate various open-source and proprietary multimodal LLMs and analyze their shortcomings. Demos and data are available at https://github.com/lark-png/AVUT .