Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning

Created by
  • Haebom

Author

Zhuokun Chen, Zeren Chen, Jiahao He, Mingkui Tan, Jianfei Cai, Bohan Zhuang

Outline

This paper proposes R-Stitch, a novel method for reducing the computational cost of Chain-of-Thought (CoT) inference. CoT inference enhances the problem-solving ability of large-scale language models (LLMs), but it is computationally expensive due to its autoregressive decoding of long token sequences. Existing acceleration strategies either reduce sequence length through early stopping or compression compensation schemes, or improve decoding speed through predictive decoding using small-scale models. However, predictive decoding has limited speedup when the agreement between the small-scale and large-scale models is low, and fails to leverage the potential benefits of small-scale models in generating concise intermediate inferences. R-Stitch is a token-level confidence-based hybrid decoding framework that switches between small-scale language models (SLMs) and large-scale language models (LLMs), utilizing LLMs only when the SLM's confidence falls below a threshold, maintaining both efficiency and accuracy. It is model-independent, requires no training, and is compatible with standard decoding pipelines. Mathematical inference benchmark experiments show that R-Stitch reduces inference latency by up to 85% with minimal accuracy degradation.

Takeaways, Limitations

Takeaways:
We present a novel method to effectively reduce the computational cost of CoT inference.
Experimentally demonstrated that inference latency can be reduced by up to 85% with virtually no accuracy degradation.
It is model-agnostic, requires no training, and is compatible with standard decoding pipelines, making it highly practical.
Limitations:
Further research may be needed to establish the reliability threshold for SLM.
Further evaluation of generalization performance for different types of problems and models may be required.
If the performance difference between SLM and LLM is large, performance improvement may be limited.
👍