In this paper, we present an efficient and effective integration of computationally expensive pre-trained language models (PLMs) and graph neural networks (GNNs) in text-rich heterogeneous graphs. We propose a framework called Graph Masked Language Model (GMLM), which consists of two stages: a contrastive pre-training stage using a soft masking technique and an end-to-end fine-tuning stage using a dynamic active node selection strategy and a bidirectional cross-attention module. Experimental results on five heterogeneous benchmarks show that GMLM achieves state-of-the-art performance on four benchmarks and significantly outperforms existing GNN and large-scale LLM-based methods. For example, it improves accuracy by more than 8% on the Texas dataset and by nearly 5% on the Wisconsin dataset. This study demonstrates that sophisticated and deeply integrated architectures can be more effective and efficient than larger, more general-purpose models for learning text-rich graph representations.