This paper presents the Vision Transformer Digital Twin Surrogate Network (VT-DTSN), a deep learning framework for predictive modeling of 3D+T imaging data of biological tissues. Leveraging pre-trained Vision Transformers with Self-Distillation with No Labels (DINO) and a multi-view fusion strategy, we reconstruct high-fidelity, time-resolved dynamics of the Drosophila midgut. Trained with a composite loss function that prioritizes pixel-level accuracy, perceptual structure, and feature-space alignment, the network produces biologically meaningful results, making it suitable for in silico experiments and hypothesis testing. Evaluation across multiple layers and biological replicates yields low error rates and high structural similarity, enabling efficient inference. VT-DTSN serves as a high-fidelity surrogate model for cross-time reconstruction and tissue dynamics studies, enabling computational exploration of cellular behavior and homeostasis, complementing time-resolved imaging studies in biological research.