Inspecting enclosed industrial infrastructure, such as ventilation ducts, requires robust collision-tolerant navigation policies for autonomous unmanned aerial vehicles (UAVs). This paper investigates the trade-offs between on-policy and off-policy algorithms for safety-critical, high-precision navigation tasks. Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) were compared to learn to perform precision flight in procedurally generated ducts within a simulator. While PPO consistently learned stable, collision-free policies to complete the entire process, SAC failed to find a complete solution, converging to a suboptimal policy that explored only the initial segments before failing.