Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Created by
  • Haebom

Author

Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Penge

Outline

This paper proposes the CRED-SQL framework to improve the accuracy of Text-to-SQL systems, which convert Natural Language Queries (NLQs) into SQL queries in large-scale databases. Existing Text-to-SQL systems suffer from schema association and semantic drift issues due to semantically similar properties in large-scale databases, leading to reduced accuracy. CRED-SQL addresses these issues by accurately identifying tables and columns related to NLQs through cluster-based, large-scale schema search and introducing an intermediate representation language, Execution Description Language (EDL), between NLQ and SQL. This two-step process—translating NLQs into EDLs and EDLs into SQL—levers the powerful inference capabilities of LLMs while reducing semantic drift. Experimental results on two large-scale cross-domain benchmarks, SpiderUnion and BirdUnion, demonstrate that CRED-SQL achieves state-of-the-art performance.

Takeaways, Limitations

Takeaways:
A novel method for improving the accuracy of text-to-SQL systems in large databases is presented.
Effective schema matching and semantic drift reduction using cluster-based schema search and EDL.
Leverage the powerful inference capabilities of LLMs to improve performance on Text-to-SQL tasks.
Achieve state-of-the-art performance in SpiderUnion and BirdUnion benchmarks.
Ensuring reproducibility and expandability of research through open source code disclosure.
Limitations:
Further research is needed to determine the generalizability of the proposed framework.
Need for performance evaluation and improvement for specific types of complex NLQs
Research is needed to optimize EDL design and adapt it to various database schemas.
Further experiments are needed to determine performance and scalability in real-world application environments.
👍