Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

Speculative Decoding via Hybrid Drafting and Rollback-Aware Branch Parallelism

Created by
  • Haebom

Author

Yuhao Shen, Junyi Shen, Quan Kong, Tianyu Liu, Yao Lu, Cong Wang

Outline

This paper explores the parallel performance of speculative decoding (SD), a promising technique for accelerating large-scale language model (LLM) inference. To address the bottleneck caused by serial execution of existing SD methods, we propose SpecBranch , a novel framework inspired by branch prediction in modern processors . SpecBranch introduces parallel speculative branches to mitigate expected rejections and enhances parallelism through adaptive draft length and a combination of implicit and explicit model confidence levels. Experimental results on various models and benchmarks demonstrate that SpecBranch achieves 1.8x to 4.5x speedup compared to autoregressive decoding, while maintaining the same sampling distribution and reducing rollback tokens by 50% even under poorly aligned models.

Takeaways, Limitations

Takeaways:
We present a new SpecBranch framework that improves LLM inference speed by 1.8 to 4.5 times.
Increased efficiency by reducing rollback tokens by 50% even when model alignment is incomplete.
Achieve speedup while maintaining the same sampling distribution.
We present a case study of successful application of branch prediction techniques of modern processors to LLM inference.
Limitations:
SpecBranch's performance improvements may vary depending on the model and benchmark used. (Further research is needed to determine the generalizability of these experimental results.)
There may be overhead for parallel processing, which may require additional optimization.
Further research is needed to explore its applicability to different LLM architectures and sizes.
👍