Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Tokenization for Molecular Foundation Models

Created by
  • Haebom

Author

Alexius Wadell, Anoushka Bhutani, Venkatasubramanian Viswanathan

Outline

This paper highlights the importance of text-based models in molecular-based models that accelerate the advancement of molecular science and molecular design. Existing models are limited by closed-vocabulary tokenizers that capture only a portion of the molecular space. This study systematically evaluates 34 tokenizers, including 19 chemical-specific tokenizers, and reveals significant differences in their applicability to SMILES molecular representations. To evaluate the impact of tokenizer selection, we introduce an n-gram language model as a low-cost proxy, and pre-train and fine-tune 18 RoBERTa-style encoders to verify their effectiveness in molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers, Smirk and Smirk-GPE, that fully support the OpenSMILES specification. The proposed tokenizers systematically integrate nuclear, electronic, and geometric degrees of freedom, enabling applications in pharmacology, agriculture, biology, and energy storage. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics.

Takeaways, Limitations

Takeaways:
We reveal the limitations of existing tokenizers for SMILES molecular representations.
Propose a new tokenizer (Smirk, Smirk-GPE) that fully supports the OpenSMILES specification.
Evaluating the impact of tokenizer choice using n-gram language models as a low-cost proxy.
Emphasizing the importance of open vocabulary modeling and chemically diverse benchmarks in cheminformatics.
It suggests potential applications in various fields such as pharmacology, agriculture, biology, and energy storage.
Limitations:
Further comparative analysis is needed to determine how well the proposed tokenizer performs compared to other advanced tokenizers.
Further studies are needed to generalize performance across different molecule types and sizes.
Lack of performance evaluation and validation in real-world applications.
👍