Building on previous research on the utilization of visual prior information in pre-trained text-to-image (T2I) generative models for dense prediction, this paper hypothesizes that image editing models can serve as a more suitable foundation for fine-tuning dense geometry estimation than T2I generative models. To verify this, we systematically analyze the fine-tuning behaviors of the generative and editing models, demonstrating that the editing model, with its unique structural prior information, achieves more stable convergence and higher performance. Based on these findings, we propose FE2E , a novel framework that applies advanced editing models based on the Diffusion Transformer (DiT) architecture to dense geometry prediction . FE2E reconstructs the original flow matching loss of the editing model as a "consistent velocity" training objective, resolves precision conflicts using logarithmic quantization, and leverages the global attention mechanism of DiT to simultaneously estimate depth and normals in a single pass. We achieve remarkable performance improvements for zero-shot monocular depth and normal estimation on multiple datasets without massive data augmentation, notably demonstrating over 35% performance improvement on the ETH3D dataset and outperforming the DepthAnything series trained on 100x more data.