This paper proposes a research to solve the problems of existing methods on over-reliance on inter-modal correlations and low performance on data with weak correlations in the field of multimodal sentiment analysis, which is useful for various applications. Unlike existing modal interaction-based, modal transformation-based, and modal similarity-based methods, we propose a two-stage semi-supervised learning model called Correlation-aware Multimodal Transformer (CorMulT). In the pre-training phase, CorMulT efficiently learns inter-modal correlation coefficients through a modal correlation contrastive learning module, and in the prediction phase, it performs sentiment prediction by integrating the learned correlation coefficients with modal representations. Experimental results on the CMU-MOSEI dataset show that CorMulT outperforms state-of-the-art multimodal sentiment analysis methods.