This is a page that curates AI-related papers published worldwide. All content here is summarized using Google Gemini and operated on a non-profit basis. Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.
Yilin Guan, Wenyue Hua, Qingfeng Lan, Sun Fei, Dujian Ding, Devang Acharya, Chi Wang, William Yang Wang
Outline
This paper proposes an asynchronous online reinforcement learning framework called Dynamic Predictive Planning (DSP) to address the challenges of high latency and inference costs in deploying large-scale language model-based agents. DSP achieves both lossless acceleration and cost reduction without additional pre-deployment preparation, explicitly utilizing a joint objective that optimizes end-to-end latency and cost. Users can choose between fast response, low-cost operation, or a middle ground by adjusting a single parameter. Experimental results on two standard agent benchmarks show that DSP achieves efficiency comparable to the fastest lossless acceleration method while reducing total costs by 30% and unnecessary costs by up to 60%. The code and data are publicly available on GitHub ( https://github.com/guanyilin428/Dynamic-Speculative-Planning) .
Takeaways, Limitations
•
Takeaways:
◦
We present a novel method to effectively address the latency and inference cost issues of large-scale language model-based agents.
◦
Achieve lossless acceleration and cost savings simultaneously.
◦
Giving users control over the trade-off between latency and cost.
◦
Effective performance enhancement without additional prior training.
•
Limitations:
◦
Further research is needed to determine the generality of the proposed method and its applicability to various models and tasks.
◦
The experiments were limited to two standard benchmarks, so more extensive experiments are needed.
◦
Lack of analysis of long-term operation and maintenance costs.