This paper presents a comprehensive benchmark to evaluate the effectiveness and limitations of data science agents based on large-scale language models (LLMs). We design a benchmark that reflects real-world user interactions, drawing on observations of commercial applications. We evaluate three LLMs—Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini—using a zero-shot, multi-step approach and SmolAgent. We evaluate performance across eight data science task categories, analyze the model's sensitivity to common prompting problems, such as data leakage and ambiguous instructions, and investigate the impact of temperature parameters. Consequently, we illuminate performance differences between models and methodologies, highlight critical factors affecting real-world deployments, and provide a benchmark dataset and evaluation framework that lay the foundation for future research on more robust and effective data science agents.