This paper explores how to improve the learning efficiency of reinforcement learning (RL) by leveraging simulator derivatives. While existing gradient-based approaches have demonstrated superior performance compared to non-derivative approaches, accessing the simulator's gradients remains challenging due to implementation costs or inaccessibility. Model-based reinforcement learning (MBRL) can approximate these gradients using learned dynamic models, but prediction errors accumulate during training, potentially reducing solver efficiency and degrading policy performance. In this paper, we propose a method that decouples trajectory generation and gradient computation. Trajectories are developed using a simulator, and gradients are computed using backpropagation through the simulator's learned differentiable model. This hybrid design enables efficient and consistent first-order policy optimization even when simulator gradients are unavailable, and allows for learning more accurate evaluators from simulated trajectories. The proposed method achieves the sample efficiency and speed of specialized optimizers like SHAC while maintaining the generality of standard approaches like PPO and avoiding the misbehavior observed in other first-order MBRL methods. We experimentally validate the algorithm on benchmark control tasks and demonstrate its effectiveness on a real Go2 quadruped robot in both quadruped and biped walking tasks.