This paper points out that despite recent advances in high-quality consistent video generation, controllable video generation remains a critical challenge. Most existing methods treat videos as a whole, ignoring complex fine-grained spatiotemporal relationships, limiting both control precision and efficiency. In this paper, we propose a controllable video generative adversarial network (CoVoGAN) that separates video concepts to facilitate efficient and independent control of individual concepts. We separate static and dynamic latent variables based on the principle of minimal variation, and achieve component-wise identifiability of dynamic latent variables by exploiting sufficient variation properties, enabling decoupled control of video generation. We provide theoretical foundations through rigorous analysis demonstrating the identifiability of this approach, and based on these theoretical insights, we design a temporal transition module that separates latent dynamics. To enforce the principle of minimal variation and sufficient variation properties, we minimize the dimensionality of latent dynamic variables and impose temporal conditional independence. We integrate this module as a plugin to GAN to validate our approach, and through extensive qualitative and quantitative experiments on various video generation benchmarks, we demonstrate that the proposed method significantly improves generation quality and controllability in a variety of real-world scenarios.