Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalized Reasoning

Created by
  • Haebom

Author

Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang

Outline

To improve the inference performance of the Transformer LLM, we propose a "bottleneck transformer" architecture that rewrites the memory (KV) cache during the inference process. This architecture mimics the brain's memory (re)consolidation process and, based on information bottleneck theory, aims to compress the KV cache and retain important information to improve generalization performance. The proposed architecture uses a secondary transformer, the Cache Processor, to integrate new KV entries and selectively reintegrate some of the past entries. In mathematical inference benchmarks, it consistently outperforms existing Transformer and pause-token-based models.

Takeaways, Limitations

Takeaways:
Suggests the possibility of improving inference performance through KV cache (re)writing.
Justifying architectural design through information bottleneck theory.
It performs well in mathematical reasoning benchmarks.
A novel approach to applying the brain's memory (re)consolidation process to LLM.
Limitations:
Evaluation is only conducted on specific mathematical reasoning benchmarks.
Further verification of generalizability to other types of inference tasks or LLM architectures is needed.
Further research is needed on the optimal design of the Cache Processor (e.g. size, operating frequency, etc.).
Analysis of the computational complexity and efficiency of the proposed architecture is needed.
👍