XYZ-Drive is a single vision-language model for autonomous vehicles that require geometric accuracy and semantic understanding for autonomous driving in complex environments. It takes as input a forward-facing camera frame, a 25m x 25m aerial map, and the next waypoint, and outputs steering and speed. A lightweight, goal-focused cross-attention layer highlights the relevant image and map patches for waypoint tokens, and the fused tokens are fed into a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX outdoor driving benchmark, XYZ-Drive achieved a 95% success rate and a 0.80 SPL (Success Weighted by Path Length), outperforming PhysNav-DG by 15% and reducing collisions by half.