Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Created by
  • Haebom

Author

Hahyeon Choi, Junhoo Lee, Nojun Kwak

Outline

This paper points out the shortcomings of previous audio-visual localization (AVL) studies, namely, neglecting temporal dynamics and oversimplifying scenario settings, and proposes a new video-centric AVL benchmark, AVATAR, which incorporates high-resolution temporal information to address them. AVATAR is designed to enable a more comprehensive evaluation of AVL models by covering four scenarios: single sound, mixed sounds, multiple objects, and off-screen. In addition, we present TAVLO, a new video-centric AVL model that explicitly incorporates temporal information. Experimental results show that TAVLO achieves robust and accurate audio-visual alignment by leveraging high-resolution temporal modeling, while previous methods struggle to track temporal changes due to their reliance on global audio features and frame-by-frame mapping. This experimentally proves the importance of temporal dynamics in AVL, and presents a new standard for video-centric AVL.

Takeaways, Limitations

Takeaways:
Presenting a new direction for AVL research by presenting a video-centric AVL benchmark (AVATAR) and model (TAVLO) utilizing high-resolution temporal information.
Solves the problem of lack of temporal dynamics consideration in the existing AVL model, Limitations.
Comprehensive assessment possible, including a variety of scenarios (single sound, mixed sounds, multiple objects, off-screen).
Achieving more accurate and robust audio-visual alignment through integration of time information.
Limitations:
Further validation of the AVATAR benchmark and the generalization performance of the TAVLO model is needed.
There is a possibility that it may not fully reflect the complexity of the real environment.
Further analysis of the computational complexity and efficiency of the TAVLO model is needed.
👍