Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DSBC: Data Science task Benchmarking with Context engineering

Created by
  • Haebom

Author

Ram Mohan Rao Kadiyala, Siddhant Gupta, Jebish Purbey, Giulio Martini, Ali Shafique, Suman Debnath, Hamza Farooq

Outline

This paper presents a comprehensive benchmark to evaluate the effectiveness and limitations of data science agents based on large-scale language models (LLMs). We design a benchmark that reflects real-world user interactions, drawing on observations of commercial applications. We evaluate three LLMs—Claude-4.0-Sonnet, Gemini-2.5-Flash, and OpenAI-o4-Mini—using a zero-shot, multi-step approach and SmolAgent. We evaluate performance across eight data science task categories, analyze the model's sensitivity to common prompting problems, such as data leakage and ambiguous instructions, and investigate the impact of temperature parameters. Consequently, we illuminate performance differences between models and methodologies, highlight critical factors affecting real-world deployments, and provide a benchmark dataset and evaluation framework that lay the foundation for future research on more robust and effective data science agents.

Takeaways, Limitations

Takeaways:
Provides a comprehensive benchmark for evaluating data science agents that reflect real-world user interactions.
A comparative analysis of performance across various LLMs and approaches presents factors influencing real-world deployments.
Emphasize the importance of prompt engineering and temperature parameters.
Laying the foundation for future data science agent research.
Limitations:
The types and versions of LLMs used in assessment may be limited.
The data science task categories included in the benchmark may not be diverse enough.
Limitations to generalizability exist due to the benchmark design being based on observations of commercial application usage.
👍