Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases

Created by
  • Haebom

Author

Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri

Outline

This paper aims to develop a text-to-SQL system for biomedical researchers who rely on large-scale structured databases for complex analytical tasks. We highlight the difficulty of existing text-to-SQL systems in mapping qualitative scientific questions into executable SQL, particularly when implicit domain inference is required. We introduce BiomedSQL, the first benchmark designed to evaluate scientific reasoning on real-world biomedical knowledge bases. BiomedSQL consists of 68,000 question/SQL query/answer pairs, built on a BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Our results show that GPT-o3-mini achieved 59.0% execution accuracy, while the custom multi-step agent BMSQL achieved 62.6% accuracy, both falling short of the expert benchmark of 90.0%. BiomedSQL provides a new foundation for the development of text-to-SQL systems capable of supporting scientific discovery through powerful reasoning on structured biomedical knowledge bases.

Takeaways, Limitations

Takeaways:
We present a new benchmark (BiomedSQL) that assesses scientific reasoning ability using real-world biomedical knowledge bases.
Assess inference power against domain-specific criteria (genome-wide significance threshold, directionality of effect, clinical trial phase filtering, etc.).
Evaluate the performance of open-source and closed-source LLMs and identify performance gaps.
Suggests areas for improvement in text-SQL systems to address complex scientific questions.
Supporting reproducibility and advancement of research through open data sets and code.
Limitations:
The execution accuracy of the presented system (GPT-o3-mini, BMSQL) does not meet expert standards.
Because the benchmark is based on a specific database (BigQuery), further research is needed to determine its generalizability to other knowledge bases.
Due to the complexity of the inference and SQL generation process, it may be difficult to analyze errors and derive improvement measures.
👍