[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Two-Stage Pretraining for Molecular Property Prediction in the Wild

Created by
  • Haebom

Author

Kevin Tirta Wijaya, Minghao Guo, Michael Sun, Hans-Peter Seidel, Wojciech Matusik, Vahid Babaei

Outline

In this paper, we propose MoleVers, a multi-objective pre-trained molecular model for predicting diverse molecular features in environments lacking experimentally validated data. MoleVers employs a two-stage pre-training strategy. In the first stage, molecular representations are learned from unlabeled data via masked atom prediction and extreme noise removal, enabled by a novel branching encoder architecture and dynamic noise scale sampling. In the second stage, these representations are improved via auxiliary feature predictions derived from computational methods such as density functional theory or large-scale language models. Evaluation results on 22 small experimentally validated datasets show that MoleVers achieves state-of-the-art performance, highlighting the effectiveness of the two-stage framework in generating generalizable molecular representations for diverse sub-features.

Takeaways, Limitations

Takeaways:
We present MoleVers, a new model capable of predicting various molecular properties using limited experimental data.
Demonstrating the feasibility of learning molecular representations with excellent generalization performance via a two-step pre-training strategy.
An effective unlabeled learning strategy is presented through masked atom prediction and extreme noise removal.
Validation of the utility of auxiliary feature prediction using various computational methods, including density functional theory and large-scale language models.
Achieving state-of-the-art performance on 22 experimental datasets.
Limitations:
The size of the 22 datasets used is small, so verification of generalization performance on large datasets is necessary.
Further research is needed on the generality of the novel branch encoder architecture and dynamic noise scale sampling and its applicability to other fields.
Problems with computational resource consumption due to the use of computationally expensive dense functional theory, etc.
Specialized in situations where experimental data is insufficient, uncertain performance advantage over existing models when sufficient experimental data exists.
👍