Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Created by
  • Haebom

Author

Gagan Bhatia, Maxime Peyrard, Wei Zhao

Outline

This paper addresses the problem of modern BPE tokenizers splitting dates into meaningless fragments. To address this, we introduce a novel metric, the date fragment ratio, and release the DateAugBench dataset, which encompasses three temporal inference tasks: context-based date interpretation, format-invariant puzzles, and date operations across historical, contemporary, and future timelines. Furthermore, we investigate how a large-scale language model (LLM) combines date fragments to perform temporal inference using layer-by-layer investigation and causal attention-hop analysis. We show that excessive date fragmentation leads to poor accuracy, especially for rare dates (historical and future dates). Finally, we demonstrate that the LLM's process for combining date fragments differs from human interpretation (year → month → day). The dataset and code are publicly available.

Takeaways, Limitations

Takeaways:
A new metric (date split ratio) is presented to evaluate the performance of date tokenizers.
A new dataset for temporal inference tasks (DateAugBench) released.
Provides insight into LLM's date handling mechanism (date fragment assembly process, calculation method)
We observed that the larger the LLM size, the faster the date fragment combination speed.
We demonstrate the negative impact of excessive date segmentation on temporal inference accuracy.
Limitations:
Further research is needed to determine the generality of the presented metrics and datasets.
A more in-depth analysis of the LLM's date combination process is needed.
Further research is needed across diverse languages and cultures.
👍