Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Pretraining with hierarchical memories: separating long-tail and common knowledge

Created by
  • Haebom

Author

Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel

Outline

This paper proposes a memory augmentation architecture and pre-training strategy to address the problem of modern language models' performance reliance on parameter scaling. This architecture allows small language models to access a large hierarchical parametric memory bank that encodes world knowledge. During pre-training and inference, small memory blocks are retrieved and added to the model based on context. This allows long-tailed world knowledge to be stored in the memory parameters, while the small language model learns to handle general knowledge and general inference capabilities. Experimentally, we demonstrate that adding 18M parameter memory to a 160M parameter model, drawing from a 4.6B memory bank, achieves performance comparable to that of a typical model with more than twice the number of parameters. We also investigate the optimal parametric memory type and size for a Transformer model, scaling it up to 21B parameters. We demonstrate that the proposed hierarchical feedforward memory performs robustly in a Transformer architecture, whether pre-trained or added post-trained.

Takeaways, Limitations

Takeaways:
Combining small language models with external memory banks allows us to reduce the number of model parameters while maintaining performance.
It suggests the possibility of efficiently using language models in environments with limited memory and computational resources, such as edge devices.
It can be applied to various transformer architectures by utilizing hierarchical feedforward memory.
Limitations:
Further research is needed on building and managing memory banks.
Research is needed on performance changes and optimization according to memory access method.
Further verification of the effectiveness of the proposed architecture in real-world applications is needed.
👍