This paper focuses on improving the feature representation capability of video diffusion models. Unlike previous research on video diffusion models, which primarily focused on architectural innovation or new learning objectives, this paper aims to improve performance by aligning the feature representations of pre-trained vision encoders with the intermediate features of a video generator. We evaluate suitable encoders by analyzing the discriminability and temporal coherence of various vision encoders, and based on this analysis, we propose Align4Gen, a novel multi-feature fusion and alignment method. Align4Gen demonstrates performance improvements in both conditional and unconditional video generation tasks.