This paper presents a novel framework, DeGuV, to address the problem of generalizing reinforcement learning (RL) agents' learned skills on visual inputs to new environments. DeGuV utilizes a learnable mask network to generate a mask from depth information that retains only important visual information and removes unnecessary pixels. This allows the agent to focus on key features, improving robustness under data augmentation. Furthermore, it incorporates contrastive learning and stabilizes Q-value estimation under augmentation, further improving sample efficiency and training stability. Evaluation on the RL-ViGen benchmark using the Franka Emika robot demonstrates that DeGuV outperforms state-of-the-art methods in both generalization and sample efficiency in zero-shot simulation-to-real transfer, while enhancing interpretability by highlighting the most relevant regions of the visual input.