This paper points out the shortcomings of previous audio-visual localization (AVL) studies, namely, neglecting temporal dynamics and oversimplifying scenario settings, and proposes a new video-centric AVL benchmark, AVATAR, which incorporates high-resolution temporal information to address them. AVATAR is designed to enable a more comprehensive evaluation of AVL models by covering four scenarios: single sound, mixed sounds, multiple objects, and off-screen. In addition, we present TAVLO, a new video-centric AVL model that explicitly incorporates temporal information. Experimental results show that TAVLO achieves robust and accurate audio-visual alignment by leveraging high-resolution temporal modeling, while previous methods struggle to track temporal changes due to their reliance on global audio features and frame-by-frame mapping. This experimentally proves the importance of temporal dynamics in AVL, and presents a new standard for video-centric AVL.