This paper highlights the importance of predicting short-term movements of vulnerable road users (VRUs) for the safety of autonomous driving, particularly in urban environments where ambiguous or risky behaviors are prevalent. While existing vision-language models (VLMs) enable open-vocabulary recognition, their application to fine-grained intent inference remains an unexplored area. To address this gap, this paper presents DRAMA-X, a fine-grained benchmark generated through an automatic annotation pipeline based on the DRAMA dataset. DRAMA-X includes object bounding boxes, nine-directional intent classifications, binary risk scores, expert-generated autonomous action suggestions, and descriptive motion summaries for 5,686 accident risk frames. These annotations enable a structured evaluation of four interrelated tasks (object detection, intent prediction, risk assessment, and action suggestion) that are central to autonomous driving decision-making. As a baseline, this paper proposes SGG-Intent, a lightweight, training-free framework that mirrors the inference pipeline of autonomous vehicles. SGG-Intent sequentially generates a scene graph from visual input using a VLM-based detector, infers intents, assesses risk, and recommends actions using a compositional inference step based on a large-scale language model. We evaluate various state-of-the-art VLMs and compare their performance across four tasks in DRAMA-X. Experimental results demonstrate that scene graph-based inference improves intent prediction and risk assessment, especially when contextual cues are explicitly modeled.