Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Created by
  • Haebom

Author

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian

Outline

This paper points out the problem that the video diffusion model fails to properly learn meaningful geometric structures when trained with only video data, which is a 2D projection of the 3D world. To solve this problem, we propose a 'Geometry Forcing' technique to align the features of the geometric-based model and the intermediate representation of the video diffusion model. This is achieved through two objective functions: angular alignment and scale alignment. Angular alignment enhances orientation consistency using cosine similarity, and scale alignment preserves scale information by regressing non-normalized geometric features from the normalized diffusion representation. Experiments are conducted on both camera-view condition and action condition video generation tasks, and we demonstrate that the proposed method significantly improves visual quality and 3D consistency over existing methods.

Takeaways, Limitations

Takeaways:
Presentation of an effective Geometry Forcing technique to improve 3D geometric understanding of video diffusion models.
Presenting an effective 3D information integration strategy through angle and scale alignment.
Improved video creation performance with improved visual quality and 3D consistency.
Limitations:
The effectiveness of the proposed method may depend on the specific geometrical basis model.
Additional evaluation of generalization performance on various types of video datasets is needed.
Potential increase in computational costs.
👍