Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

From Editor to Dense Geometry Estimator

Created by
  • Haebom

Author

JiYuan Wang, Chunyu Lin, Lei Sun, Rongying Liu, Lang Nie, Mingxing Li, Kang Liao, Xiangxiang Chu, Yao Zhao

Outline

Building on previous research on the utilization of visual prior information in pre-trained text-to-image (T2I) generative models for dense prediction, this paper hypothesizes that image editing models can serve as a more suitable foundation for fine-tuning dense geometry estimation than T2I generative models. To verify this, we systematically analyze the fine-tuning behaviors of the generative and editing models, demonstrating that the editing model, with its unique structural prior information, achieves more stable convergence and higher performance. Based on these findings, we propose FE2E , a novel framework that applies advanced editing models based on the Diffusion Transformer (DiT) architecture to dense geometry prediction . FE2E reconstructs the original flow matching loss of the editing model as a "consistent velocity" training objective, resolves precision conflicts using logarithmic quantization, and leverages the global attention mechanism of DiT to simultaneously estimate depth and normals in a single pass. We achieve remarkable performance improvements for zero-shot monocular depth and normal estimation on multiple datasets without massive data augmentation, notably demonstrating over 35% performance improvement on the ETH3D dataset and outperforming the DepthAnything series trained on 100x more data.

Takeaways, Limitations

Takeaways:
We experimentally demonstrate that image editing models are a more suitable foundation for dense prediction tasks such as dense geometry estimation.
We present an FE2E framework that significantly improves zero-shot monocular depth and normal estimation performance by effectively utilizing an editing model based on Diffusion Transformer.
It shows the possibility of achieving excellent performance even without large amounts of data.
An efficient method for simultaneous estimation of depth and normal in a single pass is presented.
Limitations:
The performance improvements of FE2E may be limited to specific datasets.
Generalization performance verification is needed for other types of dense prediction tasks.
Dependency on the Diffusion Transformer architecture. Scalability to other architectures needs to be reviewed.
👍