Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description

Created by
  • Haebom

Author

Shaoming Duan, Zirui Wang, Chuanyi Liu, Zhibin Zhu, Yuhao Zhang, Peiyi Han, Liang Yan, Zewu Peng

Outline

This paper proposes the CRED-SQL framework to improve the accuracy of Text-to-SQL systems, which convert Natural Language Queries (NLQs) into SQL queries in large-scale databases. Existing Text-to-SQL systems suffer from poor accuracy due to schema matching errors and semantic drift caused by semantically similar attributes in large databases. CRED-SQL resolves this schema mismatch problem by accurately identifying tables and columns related to NLQs through cluster-based, large-scale schema search. Furthermore, by introducing Execution Description Language (EDL), an intermediate representation language between NLQ and SQL, CRED-SQL decomposes the task into two steps: Text-to-EDL and EDL-to-SQL. This decomposition leverages the powerful inference capabilities of LLMs while reducing semantic drift. Experimental results on two large-scale cross-domain benchmarks, SpiderUnion and BirdUnion, demonstrate CRED-SQL's effectiveness and scalability by achieving state-of-the-art performance.

Takeaways, Limitations

Takeaways:
We propose a new framework, CRED-SQL, that significantly improves the accuracy of Text-to-SQL systems in large-scale databases.
Addressing schema mismatch and semantic drift issues through cluster-based schema discovery and EDL intermediate representation language.
Achieving state-of-the-art performance in two large-scale benchmarks.
Ensure reproducibility and extensibility through open code.
Limitations:
Further research is needed to determine the generalization performance of the proposed method. It is also necessary to verify its dependence on specific database structures or query types.
Further research is needed to optimize EDL design and improve the efficiency of the EDL-to-SQL conversion process.
Further performance evaluations are needed for databases of varying size and complexity.
👍