Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

BixBench: a Comprehensive Benchmark for LLM-based Agents in Computational Biology

Created by
  • Haebom

Author

Ludovico Mitchener, Jon M Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P Wellawatte, Andrew White, Lorenzo Sani, Samuel G Rodriques

Outline

As large-scale language models (LLMs) and LLM-based agents demonstrate their potential to accelerate scientific research, the potential for autonomous AI-based discovery in bioinformatics is growing. This paper presents the Bioinformatics Benchmark (BixBench), a novel benchmark for measuring the biological data analysis capabilities of LLM-based agents. Comprised of over 50 real-world biological data analysis scenarios with nearly 300 open-ended questions, BixBench measures the ability of LLM-based agents to explore biological datasets, perform multi-step analysis trajectories, and interpret the analysis results. Using our own agent framework, we evaluated the performance of state-of-the-art LLMs, such as GPT-4o and Claude 3.5 Sonnet, achieving only 17% accuracy in open-answer settings and random-like performance in multiple-choice settings. BixBench aims to expose the current limitations of LLMs, spurring the development of agents capable of performing rigorous bioinformatics analysis and accelerating scientific discovery.

Takeaways, Limitations

Takeaways:
We present a new benchmark (BixBench) to measure the performance of LLM-based agents in bioinformatics.
Revealing the bioinformatics analysis capabilities of the latest LLM (GPT-4o, Claude 3.5 Sonnet) Limitations.
Motivation for the development of agents that perform rigorous bioinformatics analyses in the future.
Limitations:
Low performance of the latest LLM (17% accuracy in open-ended answer settings)
Benchmarks may be limited to evaluating performance for a specific LLM.
Generalizability of our self-developed agent framework.
👍