Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Diffusion on language model encodings for protein sequence generation

Created by
  • Haebom

Author

Viacheslav Meshchaninov, Pavel Strashnov, Andrey Shevtsov, Fedor Nikolaev, Nikita Ivanisenko, Olga Kardymon, Dmitry Vetrov

Outline

DiMA, a latent diffusion framework that uses protein language model representations, presents a robust methodology that generalizes across a variety of protein encoders (8M to 3B parameters). Compared to existing autoregressive, discrete diffusion, and flow-consistent language models, it consistently performs well across extensive experiments using multiple protein representations (ESM-2, ESMc, CHEAP, SaProt) and various evaluation metrics (quality, diversity, novelty, and distribution congruence), generating novel, high-quality, and diverse protein sequences. It also supports conditional generative tasks, such as protein family generation, motif scaffolding and filling, and fold-specific sequence design.

Takeaways, Limitations

Takeaways:
This is one of the first successful applications of the continuous diffusion model to protein sequence design.
Achieve consistent high performance using the same architecture and training method for a variety of protein encoders and representations.
It outperforms existing methods such as autoregressive, discrete diffusion, and flow-consistent models.
It provides versatile functions that support various conditional generation tasks such as protein family generation and motif scaffolding.
It provides new architectural insights and practical applicability to the field of protein design.
Limitations:
This paper does not address specific Limitations. Additional experiments or analyses may be necessary to address these issues (e.g., scalability, computational cost, performance limitations for specific protein structures, etc.).
👍