Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Adaptive Computation Pruning for the Forgetting Transformer

Created by
  • Haebom

Author

Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville

Outline

We propose Adaptive Computation Pruning (ACP) to improve the efficiency of the Forgetting Transformer (FoX). FoX improves performance over the traditional Transformer by introducing a forget gate to softmax attention, but many attention heads tend to forget information quickly. ACP addresses this issue by dynamically removing computations involving input-output dependencies strongly attenuated by the forget gate. It safely performs pruning through a dynamically set pruning threshold, and applying ACP to FoX in language model pretraining reduced FLOPs and memory accesses by approximately 70%. This resulted in a 50-70 % reduction in attention execution time (a 2-3x speedup) and a 10-40% increase in end-to-end training throughput. The computational savings are greater for longer contexts. We achieved this speedup without compromising performance.

Takeaways, Limitations

Takeaways:
We present an ACP technique that significantly improves the efficiency of FoX.
Attention computation speed is improved by 2-3 times by significantly reducing FLOPs and memory accesses.
Increases end-to-end learning throughput by 10-40%.
It has a greater effect in long context lengths.
Achieve speed gains without sacrificing performance.
Limitations:
The ACP technique is specialized for FoX, and its applicability to other Transformer models requires further study.
Currently, the results are limited to a specific implementation (GitHub link provided) and further verification is needed to determine generalizability to other implementations or hardware environments.
👍