This paper points out the problem that the video diffusion model fails to properly learn meaningful geometric structures when trained with only video data, which is a 2D projection of the 3D world. To solve this problem, we propose a 'Geometry Forcing' technique to align the features of the geometric-based model and the intermediate representation of the video diffusion model. This is achieved through two objective functions: angular alignment and scale alignment. Angular alignment enhances orientation consistency using cosine similarity, and scale alignment preserves scale information by regressing non-normalized geometric features from the normalized diffusion representation. Experiments are conducted on both camera-view condition and action condition video generation tasks, and we demonstrate that the proposed method significantly improves visual quality and 3D consistency over existing methods.