Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Created by
  • Haebom

Author

Jinyang Li, Xiaolong Li, Ge Qu, Per Jacobsson, Bowen Qin, Binyuan Hui, Shuzheng Si, Nan Huo, Xiaohan Xu, Yue Zhang, Ziwei Tang, Yuanshuai Li, Florensia Widjaja, Xintong Zhu, Feige Zhou, Yongfeng Huang, Yannis Papakonstantinou, Fatma Ozcan, Chenhao Ma, Reynold Cheng

Outline

This paper highlights that solving complex SQL problems is still a major bottleneck in real-world database applications, and that existing large-scale language models (LLMs) are adept at text-to-SQL transformation, but have not been thoroughly evaluated for the more difficult task of debugging SQL problems. To address this, we present BIRD-CRITIC, a new SQL problem debugging benchmark consisting of 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi) extracted from real-world user problems and reproduced in a new environment. Even the leading inference model, O3-Mini, achieves only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi, demonstrating the complexity of the task. We also argue that it is important to advance open-source database operation models to enhance local development while protecting data privacy, and present Six-Gym (Sql-fIX-Gym), a training environment for improving the capabilities of open-source models for debugging SQL problems. Six-Gym leverages the SQL-Rewind strategy to automatically generate an executable problem-solving dataset by reverse-engineering problems from proven SQL. However, since existing path-based fine-tuning methods fail to leverage significant supervision signals, we propose f-Plan Boosting, which extracts advanced debugging plans from SQL solutions to enable teacher LLMs to generate 73.7% more successful paths for training. By integrating these components into our open-source agent, Bird-Fixer, we achieve a success rate of 38.11% on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi based on Qwen-2.5-Coder-14B, outperforming leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, a significant step towards the democratization of sophisticated SQL debugging capabilities.

Takeaways, Limitations

Takeaways:
Presenting a new SQL problem debugging benchmark BIRD-CRITIC based on real user problems
Open source SQL problem debugging model learning environment Six-Gym (Sql-fIX-Gym) and f-Plan Boosting technique proposed
Achieve performance that surpasses existing top-performing proprietary models with open source agent Bird-Fixer
Contribute to democratizing SQL debugging capabilities
Limitations:
The scale of the BIRD-CRITIC benchmark (1100 tasks) may be relatively limited. A more diverse and larger dataset may be required.
Current performance (success rate of about 30-40%) is not yet perfect and there is room for further improvement.
Potential bias towards a specific database system (PostgreSQL). Generalization performance validation across multiple database systems is needed.
Further research is needed on the generalizability of the f-Plan Boosting technique.
👍