Electrocardiogram (ECG) diagnosis remains challenging due to limited labeled data and the challenge of capturing subtle yet clinically relevant rhythm and morphological changes. In this paper, we present the Contrastive Regularized Masked Autoencoder (CREMA), a foundational model for 12-lead ECGs designed to learn generalizable representations through self-supervised pretraining. CREMA combines generative learning with contrastive regularization via the Contrastive Regularized Masked Autoencoder (MAE) loss and utilizes the Signal Transformer (SiT) architecture to capture both local waveform details and global temporal dependencies. We evaluate CREMA on benchmark datasets and in real-world clinical settings, including deployment scenarios with significant distributional shift. CREMA outperforms both supervised learning baseline models and existing self-supervised learning models in linear probing and fine-tuning evaluations. Its robustness in real-world settings is demonstrated by its superior performance across diverse clinical domains, particularly in emergency care. These results demonstrate that CREMA serves as a scalable and robust foundational model for ECG diagnosis, supporting downstream applications in heterogeneous and high-risk clinical settings.