Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Protein Structure Tokenization: Benchmarking and New Recipe

Created by
  • Haebom

Author

Xinyu Yuan, Zichen Wang, Marcus Collins, Huzefa Rangwala

Outline

In this paper, we present StructTokenBench, an integrated framework for evaluating protein structure tokenization methods that segment protein 3D structures into discrete or continuous representations. Unlike existing benchmarks, we comprehensively evaluate the quality and efficiency of tokenizers by focusing on fine-grained local substructures. The evaluation results show that no single model has an advantage in all benchmarking aspects, which leads to the discovery of low codebook utilization. In response, we develop a strategy called AminoAseed to improve tokenizer utilization and quality by improving codebook gradient updates and optimally balancing codebook size and dimensionality. AminoAseed achieves an average performance improvement of 6.31% on 24 supervised learning tasks compared to the ESM3 model, with sensitivity and utilization increased by 12.83% and 124.03%, respectively. The source code and model weights are available on Github.

Takeaways, Limitations

Takeaways:
We present StructTokenBench, a novel framework for comprehensive evaluation of protein structure tokenization methods.
Development of an effective strategy to solve the problem of low codebook utilization, AminoAseed, and verification of performance improvement.
Presenting the potential for advancement in protein structure analysis through performance improvement over existing best-performing models.
Ensuring reproducibility and scalability of research through disclosure of source code and model weights of StructTokenBench and AminoAseed.
Limitations:
Additional validation is needed to ensure that StructTokenBench comprehensively evaluates all protein structure tokenization methods.
Consider the possibility that AminoAseed's performance gains may be biased towards specific datasets or tasks.
Need to evaluate generalization performance for different protein structure types.
👍