Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

작성자
  • Haebom

Author

Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh

Outline

JointDiT is a diffusion transformer that models the joint distribution of RGB images and depth information. It leverages the structural advantages of state-of-the-art diffusion transformers and excellent image prior information to generate high-quality images and geometrically plausible and accurate depth maps. Two effective techniques—adaptive scheduling weights (which vary according to the noise level of each modality) and an imbalanced timestep sampling strategy—learn the model under all noise levels. This allows it to naturally handle various combinatorial generation tasks, such as joint generation, depth estimation, and depth-conditional image generation, by controlling the timesteps of each branch. JointDiT demonstrates excellent joint generation performance and achieves similar results for depth estimation and depth-conditional image generation, suggesting that joint distribution modeling can be a viable alternative to conditional generation.

Takeaways, Limitations

Takeaways:
A novel method for effectively modeling the joint distribution of RGB images and depth information is presented.
Ability to generate high-quality images and accurate depth maps simultaneously.
Applicable to various tasks such as joint generation, depth estimation, and depth-conditional image generation.
Presenting a viable alternative to conditional generation.
Limitations:
The paper lacks specific Limitations or any limitations mentioned.
It is possible that only performance on a specific dataset was presented, and generalization performance to other datasets requires additional validation.
Lack of information about computational costs and memory usage.
👍