This paper proposes a novel approach for 3D mesh reconstruction from multi-view images. Inspired by large-scale reconstruction models such as LRM, it utilizes a transformer-based triplane generator and a Neural Radiance Field (NeRF) model trained on multi-view images. We analyze and improve upon the shortcomings of existing LRM architectures to enhance multi-view image representations and enable computationally efficient training. Furthermore, we extract meshes from NeRF fields in a differentiable manner and fine-tune the NeRF model through mesh rendering to improve geometric reconstruction and enable supervision at full image resolution. While our approach achieves state-of-the-art performance, achieving a PSNR of 28.67 on the Google Scanned Objects (GSO) dataset, it struggles to reconstruct complex textures (e.g., text, portraits). To address this, we introduce a lightweight, instance-specific texture enhancement procedure that fine-tunes the triplane representation and NeRF color estimation model in just 4 seconds, improving the PSNR to 29.79 and accurately reconstructing complex textures. Furthermore, our approach enables various downstream applications, such as 3D generation from text or images.