[공지사항]을 빙자한 안부와 근황 
Show more

Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Created by
  • Haebom

Author

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi

Outline

This paper aims to demonstrate the usefulness of the n-gram language model even in the era of the Large Language Model (LLM) by modernizing the existing n-gram model using large-scale data of 5 trillion tokens. In particular, we developed an infinite n-gram (∞-gram) model that can set the value of n arbitrarily large, and an infini-gram engine that calculates the ∞-gram probability with a millisecond-level delay based on a suffix array. Through this, we performed analysis of human-written and machine-generated texts, and confirmed the high accuracy (47%) of the ∞-gram model and the perplexity reduction effect of the LLM. In addition, we discovered defects in the positional embedding of the Transformer and the LLM pre-training through analysis of machine-generated text.

Takeaways, Limitations

Takeaways:
Reevaluation of n-gram models through building a large-scale n-gram language model with a 5 trillion token scale.
Development of ∞-gram model and infini-gram engine to improve performance of n-gram model and suggest new analysis possibilities.
Presenting the possibility of discovering and improving the performance of LLM's Limitations through text analysis using the ∞-gram model.
Fast ∞-gram probability calculations at the millisecond level enable real-time applications.
Limitations:
The performance evaluation of the ∞-gram model presented in this study may be limited to a specific dataset.
The computational efficiency of the infini-gram engine may vary depending on the data size and n value.
Although the shortcomings of LLM were pointed out, specific improvement measures were lacking.
👍