This paper proposes a technique that explicitly enhances temporal consistency by extending LiDAR-BIND, a modular multimodal fusion framework that combines heterogeneous sensors (radar and sonar) into a LiDAR-defined latent space. Key contributions include (i) temporal embedding similarity, which aligns continuous latent representations; (ii) a motion-aligned loss, which matches displacements between predictions and ground-based LiDAR; and (iii) window-based temporal fusion using a specialized temporal module. Furthermore, we update the model architecture to better preserve spatial structure. Evaluation of the radar/sonar-to-LiDAR conversion demonstrates improved temporal and spatial consistency, leading to reduced absolute trajectory error and improved occupancy map accuracy in cartographer-based SLAM. We propose various metrics based on the Frequent Video Motion Distance (FVMD) and the correlation peak distance metric, providing practical temporal quality metrics for evaluating SLAM performance. The proposed Temporal LiDAR-BIND (LiDAR-BIND-T) significantly improves temporal stability while maintaining modular modality fusion, thereby enhancing the robustness and performance of downstream SLAM.