Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

What's Making That Sound Right Now? Video-centric Audio-Visual Localization

Created by
  • Haebom

Author

Hahyeon Choi, Junhoo Lee, Nojun Kwak

Outline

AVATAR is a video-centric audio-visual localization (AVL) benchmark that incorporates high-resolution temporal information. It is proposed to overcome the limitations of previous studies that focus only on image-level audio-visual associations, fail to capture temporal dynamics, and assume simplified scenarios where sound sources are always visible and contain only a single object. AVATAR introduces four scenarios: single sound, mixed sound, multiple objects, and off-screen, allowing for a more comprehensive evaluation of AVL models. In addition, we present TAVLO, a new video-centric AVL model that explicitly incorporates temporal information. Experimental results show that while previous methods struggle to track temporal changes due to their reliance on global audio features and frame-level mappings, TAVLO achieves robust and accurate audio-visual alignment by leveraging high-resolution temporal modeling. This study empirically demonstrates the importance of temporal dynamics in AVL and sets a new standard for video-centric audio-visual localization.

Takeaways, Limitations

Takeaways:
Presenting a video-centric AVL benchmark (AVATAR) incorporating high-resolution temporal information
A novel video-centric AVL model (TAVLO) that explicitly integrates temporal information is presented.
Empirically demonstrating the importance of temporal dynamics in AVL
A new approach is presented that complements the existing Limitations method.
Setting a new standard for video-centric AVL research
Limitations:
Further research is needed on the generalizability of the AVATAR benchmark
Further research is needed on the computational cost and real-time processing performance of the TAVLO model.
Need to expand benchmarks to include more diverse and complex scenarios
👍