This paper presents a novel self-supervised learning (SSL) framework for handwritten mathematical expression recognition (HMER). Designed to eliminate the need for expensive, conventional labeled data, the framework pretrains an image encoder by combining global and local contrastive losses. This allows for learning both global and fine-grained representations. Furthermore, we propose a novel self-supervised attention network, trained using a progressive spatial masking strategy. This attention mechanism focuses on meaningful regions, such as operators, exponents, and nested mathematical notation, without any supervision. The progressive masking curriculum enhances structural understanding by making the network increasingly robust to missing or occluded visual information. The overall pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention training, and (3) supervised fine-tuning using a Transformer decoder (for LaTeX sequence generation). Extensive experiments on the CROHME benchmark demonstrate the effectiveness of the progressive attention mechanism, outperforming existing SSL and fully supervised baseline models.