Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Dynamic Quality-Latency Aware Routing for LLM Inference in Wireless Edge-Device Networks

Created by
  • Haebom

Author

Rui Bao, Nan Xue, Yaping Sun, Zhiyong Chen

Outline

This paper aims to provide ubiquitous intelligent services by integrating wireless communications and large-scale language models (LLMs). In collaborative wireless edge device environments, the tradeoff between inference quality and end-to-end latency is a critical issue. Simple queries incur excessive latency, while on-device models underperform complex computations, creating a mismatch between task complexity and resource allocation. To address this, we propose a dynamic, quality-latency-aware routing framework that coordinates inference between lightweight models on mobile devices and powerful models on edge servers. For single-turn queries, the framework integrates BERT predicted semantic scores with communication and computation overhead, while for multi-turn conversations, it utilizes two distinct cost models that further quantify the context-aware costs incurred in model switching and KV cache management. Extensive experimental results demonstrate that our proposed framework reduces average response latency by 5-15% and reduces large-scale model invocations by 10-20% compared to competing baselines on the MMLU, GSM8K, and MT-Bench-101 benchmarks, while maintaining full inference quality.

Takeaways, Limitations

Takeaways:
A dynamic routing framework is presented to effectively reduce the latency of LLM-based services in wireless edge environments.
Simultaneously improve inference quality and efficiency with optimized cost models for single-turn and multi-turn queries.
Experimentally verified performance improvements on MMLU, GSM8K, and MT-Bench-101 benchmarks.
Limitations:
Further research is needed on the practical application of the proposed framework.
Robustness assessment is needed for various wireless network conditions and edge device resource constraints.
There is a need to review whether the performance evaluation results for a specific benchmark can be generalized to other benchmarks or real-world applications.
👍