Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Created by
  • Haebom

Author

Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith CH Ngai, Emad Barsoum

Outline

In this paper, we propose a speculative decoding (SPD) method to accelerate the autoregressive token generation process of large-scale language models (LLMs). Existing SPD methods use draft models with multiple heads to predict future token sequences, but they have limitations in that they treat all tokens equally and rely on a single generation method (serial or parallel). In this paper, we theoretically prove that early tokens are more important than later tokens, and based on this, we propose Gumiho, a hybrid model that combines serial and parallel heads. Gumiho uses serial heads with sophisticated Transformer architecture to improve the accuracy of early tokens, and multiple lightweight MLP heads that operate in parallel to improve efficiency. We achieve overall performance improvement by assigning more advanced model structures and longer execution times to early heads. Experimental results show that the proposed method outperforms existing methods.

Takeaways, Limitations

Takeaways: Experimentally prove the effectiveness of a hybrid SPD approach that takes into account the importance of initial tokens. A novel method is presented that combines the advantages of serial and parallel processing to improve the speed and accuracy of token generation in LLM. The Gumiho model outperforms the conventional SPD method.
Limitations: The performance improvement of the Gumiho model may be limited to certain LLMs and datasets. Further research is needed to determine the generalizability of the theoretical proof of the importance of initial tokens. Additional experiments on various LLMs and tasks are needed.
👍