Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

Created by
  • Haebom

Author

Wei Da, Evangelia Kalyvianaki

Outline

This paper presents Block, a distributed scheduling framework that leverages contextual information about incoming requests to optimize load balancing and automatic provisioning across instances in a large-scale language model serving framework. Unlike existing model serving systems that rely on monolithic, heuristic task schedulers, Block operates as a fully distributed, stateless, and predictive scheduling system, resulting in low overhead, reliability, and scalability. It leverages the deterministic and predictable properties of LLM inference, such as host configuration, response length, and hardware performance, to make scheduling decisions based on accurately predicted metrics. Evaluation results on a 12-GPU cluster demonstrate that Block significantly outperforms heuristic schedulers, increasing serving capacity by up to 16.7% and reducing P99 latency by up to 49.5%. These performance gains are consistent across a variety of models, workloads, and configurations. The code and data are open source.

Takeaways, Limitations

Takeaways:
We present a novel distributed scheduling framework that can significantly improve the performance of large-scale language model serving systems.
Increase serving capacity and reduce latency by efficiently performing load balancing and automatic provisioning.
Leveraging the properties of LLM inference, we enable accurate prediction-based scheduling.
It is open source and can be utilized by other researchers.
Limitations:
As it has only been evaluated on a 12-GPU cluster, further research is needed to determine how performance will change on larger clusters.
Although we have performed evaluations on a variety of models and workloads, further validation is needed to ensure generalizability across all types of LLMs and workloads.
Further evaluation of long-term stability and scalability in real-world operating environments is required.
👍