Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance

Created by
  • Haebom

Author

Andrei-Valentin T\u{a}nase, Elena Pelican

Outline

This paper proposes SupraTok, a novel tokenization architecture, to address the tokenization bottleneck in natural language processing. SupraTok reimagines subword segmentation through three innovative methods: boundary-crossing pattern learning to discover multi-word semantic units, entropy-based data curation to optimize training corpus quality, and multi-stage curriculum learning to ensure stable convergence. By extending byte-pair encoding, it learns "superword" tokens, consistent multi-word representations that maximize compression efficiency while maintaining semantic consistency. Experimental results show that SupraTok improves English tokenization efficiency by over 30% compared to OpenAI's o200k tokenizer and Google's Gemma 3 tokenizer, maintaining competitive performance across 38 languages. Incorporating it into a GPT-2-scale model also improves benchmark performance.

Takeaways, Limitations

Takeaways:
We demonstrate that efficient tokenization can contribute to improving language model performance.
SupraTok is more efficient than existing tokenizers and supports multiple languages.
It is possible to improve the performance of language models without changing the architecture.
Limitations:
Currently, it has only been tested on GPT-2 scale models (124M parameters) and validation on larger model scales is needed.
👍