Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

VoltanaLLM: Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

Created by
  • Haebom

Author

Jiahuan Yu, Aryan Taneja, Junfeng Lin, Minjia Zhang

Outline

To address the high energy cost of large-scale language model (LLM) inference, this paper presents VoltanaLLM, an energy-efficient LLM serving system that considers service-level objectives (SLOs). VoltanaLLM co-designs frequency scaling and request routing in a novel, prefill/decode decoupled architecture from a control theory perspective. A feedback-based frequency controller dynamically adjusts GPU frequencies in the prefill and decode stages, and a state-space router explores inter-instance routing decisions to minimize energy under latency constraints. VoltanaLLM, implemented in SGLang, has been evaluated on several state-of-the-art LLMs and real-world datasets, achieving up to 36.3% energy savings and nearly perfect SLO achievement. The source code is available on GitHub.

Takeaways, Limitations

Takeaways:
We present a novel system that can significantly improve the energy efficiency of LLM inference.
Minimizing energy consumption while complying with SLOs through a control theory-based approach.
Fine-grained control possible by leveraging the advantages of prefill/decode separation architecture.
Effectiveness proven through performance verification in real environments.
Increased research and usability through open source disclosure.
Limitations:
May depend on specific architecture (prefill/decode separation architecture).
Further research is needed on generalization performance across various LLMs and datasets.
Long-term stability and scalability verification in actual operating environments is required.
Potential difficulties in implementation and maintenance due to the complexity of control theory.
👍