[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

Created by
  • Haebom

Author

Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song

Outline

In this paper, we compare and analyze the superiority of k-mer segmentation and BPE subword tokenization methods based on previous studies that regard DNA sequences as languages and apply the Transformer model. We train 3-, 6-, 12-, and 24-layer Transformer encoders using k-mer segmentation with k=1, 3, 4, 5, and 6, BPE vocabulary with a token size of 4,096, and three positional encoding methods, sinusoidal, AliBi, and RoPE, and evaluate them on the GUE benchmark dataset. The experimental results show that BPE reduces the sequence length by compressing frequent motifs into variable-length tokens and improves the model generalization performance, resulting in higher and more stable performance. Among the positional encoding methods, RoPE is excellent at capturing periodic motifs and extrapolating them to long sequences, and AliBi performs well in tasks based on local dependencies. Experimental results on the number of layers show that the performance improvement is remarkable when increasing from 3 to 12 layers, and insignificant improvement or overfitting phenomenon is observed at 24 layers. This study provides practical guidance on the design of tokenization and positional encoding of DNA Transformer models.

Takeaways, Limitations

Takeaways:
We experimentally demonstrate that BPE tokenization outperforms k-mer segmentation in DNA sequence processing.
We demonstrate that RoPE positional encoding is effective for processing periodic motifs and long sequences.
AliBi positional encoding is shown to be suitable for tasks with strong local dependency.
The number of layers in the Transformer model is suggested to be approximately 12 layers (more than 24 layers may lead to overfitting).
Provides practical guidelines for designing Transformer models for DNA sequence analysis.
Limitations:
Evaluation using only the GUE benchmark dataset, verification of generalization performance on other datasets is required.
Only BPE vocabularies of a certain size (4,096) are used, further research is needed on vocabularies of other sizes.
Only a limited number of positional encoding methods are compared, further research on other methods is needed.
Research on other types of Transformer model architectures is needed.
👍