This paper presents a large-scale dataset and a novel multimodal feature fusion framework to improve the accuracy of predicting survival in non-small cell lung cancer (NSCLC) patients receiving immune checkpoint inhibitor (ICI) therapy. The large-scale dataset consists of 3D CT images, clinical records, progression-free survival (PFS), and overall survival (OS) data from NSCLC patients. The proposed framework utilizes a cross-modality mask learning approach consisting of two branches, each tailored to a specific modality: a Slice-Depth Transformer for CT images and a Graph-based Transformer for clinical variables. The masked modality learning strategy reconstructs missing components using the intact modality, enhancing the integration of modality-specific features and promoting effective inter-modality relationships and feature interactions. This demonstrates multimodal fusion performance for NSCLC survival prediction that surpasses existing methods and sets a new benchmark.