AVATAR is a video-centric audio-visual localization (AVL) benchmark that incorporates high-resolution temporal information. It is proposed to overcome the limitations of previous studies that focus only on image-level audio-visual associations, fail to capture temporal dynamics, and assume simplified scenarios where sound sources are always visible and contain only a single object. AVATAR introduces four scenarios: single sound, mixed sound, multiple objects, and off-screen, allowing for a more comprehensive evaluation of AVL models. In addition, we present TAVLO, a new video-centric AVL model that explicitly incorporates temporal information. Experimental results show that while previous methods struggle to track temporal changes due to their reliance on global audio features and frame-level mappings, TAVLO achieves robust and accurate audio-visual alignment by leveraging high-resolution temporal modeling. This study empirically demonstrates the importance of temporal dynamics in AVL and sets a new standard for video-centric audio-visual localization.