Daily Arxiv

This page organizes papers related to artificial intelligence published around the world.
This page is summarized using Google Gemini and is operated on a non-profit basis.
The copyright of the paper belongs to the author and the relevant institution. When sharing, simply cite the source.

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Created by
  • Haebom

Author

Zhuokun Chen, Zeren Chen, Jiahao He, Lu Sheng, Mingkui Tan, Jianfei Cai, Bohan Zhuang

Outline

Improving the problem-solving ability of large-scale language models (LLMs) using Chain-of-Thought (CoT) comes at a high inference cost. R-Stitch presents a training-free hybrid decoding framework that utilizes token-level entropy as an uncertainty metric to delegate computation between SLMs and LLMs. R-Stitch delegates high-entropy tokens to LLMs to prevent full rollbacks and maintain answer quality. R-Stitch$^{+}$ learns an adaptive routing policy that dynamically adjusts token budgets beyond a fixed threshold. This method achieves significant speedups while minimizing accuracy loss by reducing per-token decoding complexity and the number of generated tokens. We achieve speedups of up to 3.00x on DeepSeek-R1-Distill-Qwen-7B, 3.85x on 14B, and 4.10x on QWQ-32B. Furthermore, it enables adaptive efficiency-accuracy tradeoffs that can be adjusted to accommodate varying computational budgets without retraining.

Takeaways, Limitations

Takeaways:
A training-free hybrid decoding framework is presented to improve the inference speed of LLM.
Distributing the computational load between SLM and LLM by leveraging token-level entropy.
Learning adaptive routing policies and dynamic token budget adjustments via R-Stitch$^{+}$.
Improved speed and maintained accuracy across a variety of models and environments.
Efficiency-accuracy trade-off possible without retraining.
Limitations:
Lack of information about specific model architecture, training data, hyperparameters, etc.
Further validation of generalization performance across different tasks and LLMs is needed.
Lack of performance evaluation in real-world usage environments.
👍