CAREL (Cross-modal Auxiliary REinforcement Learning) is a novel framework for language-guided goal-achievement reinforcement learning problems, based on instructions within the environment. It uses an auxiliary loss function inspired by video-text retrieval and instruction tracking, a novel method for automatically tracking progress within the environment. It focuses on improving the model's generalization across diverse tasks and environments, enabling the agent to understand multiple parts of the instructions within the environmental context to successfully complete the entire task in goal-achievement scenarios. Experimental results demonstrate excellent sample efficiency and systematic generalization performance in multimodal reinforcement learning problems.