Reinforcement learning (RL) has achieved remarkable success in solving complex decision-making problems, but the uninterpretability of its decision-making processes hinders its adoption in critical domains. Existing explainable AI (xAI) approaches often fail to provide meaningful explanations for RL agents, particularly because they overlook the contrastive nature of human reasoning (answering questions like "Why did you choose this action over another?"). To address this gap, this paper proposes $\textbf{VisionMask}$, a novel framework for contrastive learning that uses self-supervised methods to train agents to generate explanations by explicitly contrasting the agent's chosen action with alternative actions in a given state. Experiments in various RL environments demonstrate the efficacy of VisionMask in terms of fidelity, robustness, and complexity. The results demonstrate that VisionMask significantly enhances human understanding of agent behavior while maintaining accuracy and fidelity. We also present examples demonstrating how VisionMask can be used for counter-empirical analysis. This research bridges the gap between RL and xAI, paving the way for safer and more interpretable RL systems.