This paper proposes a perceptually optimized video compression framework that leverages the conditional diffusion model, which excels at reconstructing video content that matches human visual perception. We reframe video compression as a conditional generative task, where a generative model synthesizes video from sparse but information-rich signals. We introduce three main modules: multi-particle conditioning, which captures both static scene structure and dynamic spatiotemporal cues; a compact representation designed for efficient transmission without sacrificing semantic richness; and multi-conditional training using modality dropout and role-aware embeddings to avoid overreliance on a single modality and enhance robustness. Extensive experiments demonstrate that the proposed method significantly outperforms both conventional and neural codecs on perceptual quality metrics such as the Fréchet Video Distance (FVD) and LPIPS, especially at high compression ratios.