This paper presents HESCAPE, a large-scale benchmark for evaluating multimodal learning methods that leverage both tissue morphology images and gene expression data in spatial transcriptomics. Based on a curated whole-organ dataset comprising six gene panels and 54 donors, we systematically evaluate state-of-the-art image and gene expression encoders across various pre-training strategies and assess their effectiveness in two subsequent tasks: gene mutation classification and gene expression prediction. This study demonstrates that gene expression encoders are a key determinant of robust expression alignment, with gene models pre-trained with spatial transcriptomics data outperforming models trained without spatial data and simple baseline approaches. However, subsequent evaluations reveal a paradoxical result: while contrastive pre-training consistently improves gene mutation classification performance, it degrades direct gene expression prediction performance compared to baseline encoders trained without cross-modal objectives. Batch effects are identified as a key factor hindering effective cross-modal alignment, highlighting the importance of batch-robust multimodal learning approaches in spatial transcriptomics. Finally, we open source HESCAPE to provide a standardized dataset, evaluation protocol, and benchmarking tools.