In this paper, we propose a novel attention-based multiscale temporal fusion network (AMTFNet) for fault diagnosis in multimodal processes. To overcome the difficulty of extracting shared features due to the difference in the distribution of multimodal data, we extract multiscale local features and short- and long-term features using multiscale depth-wise convolutions and gated recurrent units, and suppress mode-specific information through instance normalization. In addition, we improve the fault diagnosis accuracy by focusing on the critical time points where the shared information between multimodals is high through the temporal attention mechanism. The experimental results on the Tennessee Eastman Process Dataset and the Three-Phase Flow Facility Dataset demonstrate that the proposed model has excellent diagnosis performance and small model size. The source code will be released on GitHub.