NovaDrive is a single-branch vision-language architecture that processes front camera images, HD map tiles, LiDAR depth, and text-based waypoints in a single branch for autonomous driving in complex situations. It uses a lightweight two-stage cross-attention block to first align waypoint tokens to the HD map, then improves attention to fine-grained image and depth patches. This, combined with a novel smoothing loss that prevents abrupt steering and velocity changes, eliminates the need for circular memory. It fine-tunes the top 15 layers of the 11B LLaMA-3.2 vision-language backbone to enable real-time inference. On the nuScenes/Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive increases the success rate to 84% (+4%), improves path efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%). These gains are primarily attributed to waypoint tokens, partial VLM fine-tuning, and cross-attention fusion.