In this paper, we propose a perceptually optimized video compression framework that leverages the conditional diffusion model, which excels at reconstructing video content that matches human visual perception. We reframe video compression as a conditional generative task where a generative model synthesizes video from sparse but information-rich signals, introducing three main modules: multi-grain conditionalization that captures both static scene structures and dynamic spatiotemporal cues, a compressed representation designed for efficient transmission without sacrificing semantic richness, and multi-conditional training with modality dropout and role-aware embedding to avoid over-reliance on a single modality and improve robustness. Extensive experiments demonstrate that the proposed method significantly outperforms both conventional and neural codecs on perceptual quality metrics such as the Fréchet Video Distance (FVD) and LPIPS, especially at high compression ratios.