This paper extends LiDAR-BIND, a modular multimodal fusion framework for integrating heterogeneous sensors (radar and sonar) into a LiDAR-based latent space, with a mechanism that explicitly enforces temporal consistency. We present three contributions: first, temporal embedding similarity, which aligns continuous latent representations; second, a motion-aligned translation loss, which matches displacements between predicted and ground-truth LiDAR; and third, window-based temporal fusion using a specialized temporal module. We also update the model architecture to better preserve spatial structure. Evaluation of the radar/sonar-to-LiDAR conversion demonstrates that the enhanced temporal and spatial consistency reduces absolute trajectory errors and improves occupancy map accuracy in cartographer-based SLAM. To evaluate SLAM performance, we propose various metrics based on the Frequent Video Motion Distance (FVMD) and the correlation peak distance metric. The proposed temporal LiDAR-BIND (LiDAR-BIND-T) significantly improves temporal stability while maintaining plug-and-play modality fusion, thereby enhancing the robustness and performance of downstream SLAM.