This paper proposes a real-to-sim-to-real framework, called X-Sim. Instead of mimicking human motion, X-Sim extracts object motion from RGBD images to define object-centric rewards, which are then used to train a reinforcement learning (RL) agent. The learned policy is distilled into an image-conditional diffusion policy using synthetic rollouts rendered with various viewpoints and lighting. To transfer to the real environment, we align real and simulated observations using online domain adaptation. We demonstrate an average 30% improvement in performance across five manipulation tasks without requiring robot teleoperation data, achieve the same performance with 10x less data acquisition time than existing methods, and demonstrate good generalization to new camera viewpoints and testing time variations.