Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

A general language model for peptide identification

Created by
  • Haebom

Author

Jixiu Zhai, Zikun Wang, Tianchi Lu, Haitian Zhong, Ziyang Xu, Yuhuan Liu, Shengrui Xu, Jingwan Wang, Dan Huang

Outline

PDeepPP is an integrated deep learning framework that integrates a pre-trained protein language model and a hybrid transformer-convolution architecture, enabling robust identification across a wide range of peptide features. It systematically extracts global and local sequence features by curating extensive benchmark datasets and implementing strategies to address data imbalance. Extensive analysis, including dimensionality reduction and comparative studies, demonstrates PDeepPP's robust and interpretable peptide representations, achieving state-of-the-art performance on 25 of 33 biological identification tasks. Specifically, it achieves high accuracy in antibacterial (0.9726) and phosphorylation site (0.9984) identification, 99.5% specificity in glycosylation site prediction, and significantly reduces false negatives in antimalarial tasks. By enabling large-scale, accurate peptide analysis, PDeepPP supports biomedical research and the discovery of novel therapeutic targets for disease treatment. All code, datasets, and pretrained models are publicly available on GitHub ( https://github.com/fondress/PDeepPP ) and Hugging Face ( https://huggingface.co/fondress/PDeppPP) .

Takeaways, Limitations

Takeaways:
Provides robust and accurate identification of various peptide functions and PTM sites.
Achieve cutting-edge performance in a variety of biological tasks, including antimicrobial, phosphorylation, and glycosylation site identification.
We present a strategy to effectively address data imbalance issues.
It has great potential to contribute to biomedical research and new drug development.
All code, data, and models are publicly available and highly accessible.
Limitations:
State-of-the-art performance was not achieved in eight of the 33 tasks, indicating areas for future improvement.
This paper does not provide specific details on Limitations. Additional analysis and verification may be required.
👍