XYZ-Drive is an autonomous driving system that uses a single vision-language model as input, taking a forward-facing camera frame, a 25m x 25m aerial map, and a next waypoint as input, and outputs steering and speed. Waypoint tokens support both action and textual descriptions using a lightweight, goal-focused cross-attention layer that highlights relevant image and map patches, and the fused tokens are fed into a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark, it achieves a 95% success rate and a 0.80 success weighted by path length (SPL), a 15% improvement over PhysNav-DG, with half the number of collisions, and significantly improved efficiency by using only a single branch. We demonstrate this performance improvement through 16 ablation studies.