Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Created by
  • Haebom

Author

Julia Witte Zimmerman, Denis Hudon, Kathryn Cramer, Alejandro J. Ruiz, Calla Beauregard, Ashley Fehr, Mikaela Irene Fudolig, Bradford Demarest, Yoshi Meke Bird, Milo Z. Trujillo, Christopher M. Danforth, Peter Sheridan Dodds

Outline

This paper argues that tokenization is an essential component of the current architecture of many language models, including Transformer-based large-scale language models (LLMs) for generative AI, yet its impact on model perception is often overlooked. The researchers argue that LLMs demonstrate that the Distributional Hypothesis (DH) is sufficient for reasonably human-like language performance, and that the emergence of meaningful linguistic units between tokens and current structural constraints prompt a shift in the current linguistically indifferent tokenization techniques, particularly in their role as (1) semantic primitives and (2) mediators of important distributional patterns of human language to the model. We explore tokenization in the BPE tokenizer, existing model vocabularies from Hugging Face and TikToken, and example token vectors passing through the layers of the RoBERTa(large) model. In addition to generating suboptimal semantic components and obscuring access to essential distributional patterns in the model, we demonstrate that tokenization and pretraining can serve as a backdoor for bias and other unwanted content, which current alignment practices may not be able to ameliorate. Furthermore, we present evidence that the objective function of the tokenization algorithm influences LLM cognition, even though it is meaningfully separated from the main system intelligence.

Takeaways, Limitations

Takeaways:
We highlight the importance of tokenization algorithms in the performance and cognition of LLM.
It points out the potential bias and unwanted content inflow issues that may arise during the tokenization process and raises the need to find solutions.
It suggests the need for improvement of existing linguistically indifferent tokenization techniques.
It emphasizes the role of tokens as semantic primitives and as carriers of distributional patterns.
We investigate the impact of the objective function of the tokenization algorithm on the recognition of LLM.
Limitations:
This paper lacks specific proposals for improving tokenization.
Generalizability across different LLM architectures and tokenization methods may be limited.
There is a lack of solutions to address the issues of bias and unwanted content inflow.
👍