This paper presents NuPlanQA-Eval, a novel benchmark for evaluating the driving scene understanding capabilities of multimodal large-scale language models (MLLMs), and the large-scale dataset NuPlanQA-1M. NuPlanQA-1M consists of 1 million real-world visual question-answering (VQA) pairs, categorized into nine subtasks across three core skills: road environment recognition, spatial relationship recognition, and egocentric reasoning. Furthermore, we propose BEV-LLM, which integrates bird's-eye view (BEV) features from multi-view images into MLLM, demonstrating that conventional MLLMs struggle with driving scene-specific recognition and spatial reasoning from egocentric perspectives. BEV-LLM outperforms other models in six of the nine subtasks, demonstrating that incorporating BEV improves the performance of multi-view MLLMs. The NuPlanQA dataset is publicly available.