Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

From Sentences to Sequences: Rethinking Languages in Biological Systems

Created by
  • Haebom

Author

Ke Liu, Shuaike Shen, Hao Chen

Outline

This paper explores the potential and limitations of applying the successful large-scale language model (LLM) paradigm in natural language processing (NLP) to modeling biological languages (proteins, RNA, DNA). By reviewing previous studies that apply the autoregressive generative paradigm and evaluation metrics used in NLP to modeling biological sequences, we highlight the differences in the inherent structural correlations between natural and biological languages. In this paper, we consider the three-dimensional structure of biomolecules as the semantic content of sentences and emphasize the importance of structural evaluation by considering the strong correlations between residues or bases, and show the potential application of the autoregressive paradigm to modeling biological languages. The relevant code can be found in github.com/zjuKeLiu/RiFold.

Takeaways, Limitations

Takeaways: By emphasizing the importance of structural evaluation considering three-dimensional structural information in biological language modeling, we present a new direction to overcome the limitations of existing NLP-based approaches. We empirically demonstrate the applicability of the autoregressive generative paradigm to biological language modeling.
Limitations: Further research is needed to determine whether the approach presented in this paper is equally applicable to all types of biological languages (proteins, RNA, DNA, etc.). Further research is needed to generalize and standardize structural evaluation metrics.
👍