In this paper, we propose Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning (TARO), a novel framework for high-fidelity and temporally consistent video-to-audio synthesis. Based on flow-based transformers that provide continuous transformation for stable training and improved synchronization and audio quality, TARO presents two key innovations. First, Timestep-Adaptive Representation Alignment (TRA) dynamically aligns latent representations by adjusting the alignment strength according to a noise schedule, ensuring smooth evolution and improved fidelity. Second, Onset-Aware Conditioning (OAC) improves synchronization with dynamic visual events by incorporating onset cues that serve as sharp event-based markers of audio-relevant visual moments. Extensive experiments on VGGSound and Landscape datasets demonstrate that TARO outperforms existing methods, with a 53% lower Frechet Distance (FD), a 29% lower Frechet Audio Distance (FAD), and a 97.19% Alignment Accuracy, highlighting its superior audio quality and synchronization precision.