Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving

Created by
  • Haebom

Author

Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, Eunjoo Jeon

Outline

This paper highlights that speculative decoding, which accelerates large-scale language model inference, relies on a fixed speculation length, which is not optimal in large-scale batch service environments with diverse requests. Therefore, this paper explores new directions for dynamic adaptation by investigating a new type of post-test diagnostic signal. To this end, we propose the Dynamic Speculative Decoding Engine (DSDE), a training-free framework based on two main components: first, a prediction signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the local stability of the generation; and second, an adaptive speculation length upper bound to mitigate delay issues at each sequence decoding. Experimental results demonstrate the potential of using KLD-based stability signals for dynamic adaptation. Algorithms guided by these signals achieve end-to-end latency competitive with best-in-class benchmarks and exhibit excellent robustness across a variety of workloads. This robustness is particularly valuable in low-capacity regimes, where maintaining diagnostic utility is challenging for the proposed signal. In conclusion, these findings validate that posterior signals are a crucial component for building more robust and intelligent LLM inference systems, and highlight promising directions for future research on dynamic speculation length adaptation.

Takeaways, Limitations

Takeaways:
We demonstrate that dynamic speculative decoding using KLD-based stability signals enables efficient and robust LLM inference in large-scale batch serving environments.
A training-free framework (DSDE) utilizing post-test diagnostic signals is presented, suggesting that performance improvement is possible without model retraining.
It maintains robust performance, especially in low-capacity environments, increasing adaptability to diverse workloads.
Limitations:
Further research is needed to investigate the generality of the proposed KLD-based stability signal and its applicability to other types of LLMs or tasks.
Performance improvements in DSDE may be limited to specific environments and require evaluation in a broader range of environments.
The computational cost of KLD calculations can incur additional overhead, and methods to manage this efficiently are needed.
👍