EMSYNC is a video-based symbolic music generation model that generates music tailored to the emotional content and temporal boundaries of a video. It follows a two-stage framework: a pre-trained video emotion classifier extracts emotional features, and a conditional music generator generates MIDI sequences based on these emotional and temporal cues. Specifically, we introduce a novel temporal conditioning mechanism, boundary offset, that enables prediction and alignment of musical chords to scene transitions. Unlike existing models, we maintain event-based encoding, ensuring fine-grained timing control and expressive musical nuances. Furthermore, we propose a mapping scheme for the connection between a video emotion classifier, which generates discrete emotional categories, and an emotion-conditional MIDI generator, which operates on continuous valence-arousal inputs. In subjective listening tests, EMSYNC outperformed state-of-the-art models across all subjective metrics for both music theory-savvy and casual listeners.