This paper presents a scalable and generalizable reward design that is important for building general-purpose agents in reinforcement learning (RL), especially in the challenging domain of robotic manipulation. Recent advances in reward design using visual-language models (VLMs) are promising, but the nature of sparse rewards severely limits sampling efficiency. In this paper, we propose TeViR, a novel method for generating dense rewards by comparing predicted image sequences with current observations using a pre-trained text-to-video diffusion model. Experimental results on 11 complex robotic tasks demonstrate that TeViR outperforms existing and state-of-the-art (SOTA) methods that utilize sparse rewards, and achieves better sampling efficiency and performance without real-world rewards. TeViR’s ability to efficiently guide agents in complex environments highlights its potential for advancing reinforcement learning applications in robotic manipulation.