Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

Created by
  • Haebom

Author

Haonan He, Yuchen Ren, Yining Tang, Ziyang Xu, Junxian Li, Minghao Yang, Di Zhang, Dong Yuan, Tao Chen, Shufei Zhang, Yuqiang Li, Nanqing Dong, Wanli Ouyang, Dongzhan Zhou, Peng Ye

Outline

This paper introduces Biology-Instructions, the first large-scale instruction-tuning dataset for multiomics biological sequences, encompassing DNA, RNA, proteins, and multiple molecules, to extend the application of large-scale language models (LLMs) to multiomics biology. This dataset builds a bridge between LLMs and complex biological sequence-related tasks, enhancing the diversity and inference power of LLMs while maintaining conversational fluency. Furthermore, we highlight significant limitations of state-of-the-art LLMs for multiomics tasks without specialized training and propose ChatMultiOmics, a robust baseline model with a novel three-stage training pipeline to overcome these limitations. Both Biology-Instructions and ChatMultiOmics are publicly available and pave the way for improved integration of LLMs in multiomics analyses.

Takeaways, Limitations

Takeaways:
We present Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences.
Proposing a ChatMultiOmics model with a novel three-stage training pipeline to enhance biological understanding of LLM in multiomics tasks.
Contributing to the effective integration of LLM into multidisciplinary analysis.
Increased research scalability through the release of Biology-Instructions and ChatMultiOmics.
Limitations:
Lack of clear quantitative analysis of the performance limitations of state-of-the-art LLMs for multidisciplinary tasks without specialized training.
Lack of detailed information on the performance evaluation of the ChatMultiOmics model (e.g. specific metrics, models to compare against).
Lack of detailed description of the size and diversity of the Biology-Instructions dataset.
👍