How to Efficiently Serve an LLM?
How to Efficiently Serve an LLM
LLMs, or Large Language Models, are named so because they can range from tens to hundreds of billions of parameters. Their utility is clear, as LLMs are setting new records on various benchmarks and now often match or exceed human performance in multiple tasks GPT-4 Technical Report. Consequently, many companies are eager to deploy them in production. However, due to the unprecedented size of LLMs, there are significant challenges in serving them, such as slow token generation (tokens/second), memory limits for loading model parameters, KV cache (explained later), compute limits, and more. In this article, we will cover several recent ideas to help set up a robust LLM serving system.
LLM inference Steps
- Multiple users send requests to the LLM Server through HTTPs/gRPC.
- The LLM Server receives the requests and schedules them based on QoE definition.
- QoE (Quality of Experience): Defined by two metrics:
- TTFT (Time to First Token): The time it takes for a user to receive the first response token.
- TDS (Token Delivery Speed): The rate at which the user receives tokens, which should be uniform and above the reader’s reading speed for a positive user experience.
- QoE (Quality of Experience): Defined by two metrics:
- After scheduling, the LLM Inference process is divided into two phases:
Prefill phase: The LLM processes the input tokens in parallel and generates the output activations known as the “KV Cache”. This step is highly efficient at utilizing the GPU’s parallel processing capabilities, making input tokens generally much cheaper than output tokens (as seen in the GPT-4o pricing chart). This phase produces the first output token and is typically compute-bound.
Decode phase: The LLM starts autoregressively generating output tokens one at a time. This phase is slower in terms of inference and is where optimizations are necessary. Output tokens at each step are concatenated with the previous tokens’ KV cache to generate the next token.
Optimizations
Many experts are innovating the inference stack, and multiple startups are competing to reduce costs to attract more customers.
Here are some interesting optimizations shared recently in research:
- Batching:
- Model Quantization (FP8/INT8):
- Paged Attention:
- The core idea behind vLLM, the most popular open-source serving engine, is to avoid memory fragmentation that occurs due to preserving the max context length for every request by using paging (borrowed from OS paging) to manage memory efficiently.
- Paged Attention in vLLM
- Prefill Chunking / Stale-free Batching:
- Proposed by the Sarathi-Serve paper, dividing the prefill context into smaller chunks allows merging the prefill and decode phases of different requests in the same batch.
- Prefill Decode Prioritizing
- Prefill/Decode Disaggregation:
- In constrast to the previous idea, this paper Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving proposes separating the prefill and decode phases and transferring KVCache through a specialized design.
- KVCache Transfer in Disaggregated Architecture
- KVCache Compression:
- As proposed by CacheGen, compressing the KVCache to speed up network transfer. This approach is beneficial for use cases with large context lengths (i.e., content summarization) which are over 16k input tokens to justify the encoding/decoding CPU overhead.
- KV Cache Compression
- Speculative Decoding:
- Radix Attention (Prefix Caching):
- This is the idea behind SGLang (SGLang: Efficient Execution of Structured Language Model Programs), which involves creating a data structure similar to a Prefix tree (Trie) for the KVCache to help reuse KVCache without recomputation. This only works for some use cases, like those shown in the image below:
- KV Cache Sharing Examples
- Early Rejection:
Conclusion
Efficiently serving large language models is essential for businesses to reduce costs and increase generation speed (tokens/second). This efficiency opens the door for more use cases for LLMs. With the ideas presented here, you can optimize your LLM inference stack to achieve these goals and more!
References
- Improving LLM Inference with Prefill Chunking / Stale-free batching (USENIX)
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- KVCache Compression and Streaming for Faster LLM Serving (arXiv)
- Dynamic Memory Management for LLMs: vAttention (arXiv)
- Enhancing Quality-of-Experience in LLM-Based Services (arXiv)
- Prefix Caching for Efficient LLM Inference (arXiv)
- Mastering LLM Techniques: Inference Optimization (NVIDIA Technical Blog)
- Token Probability Distribution (Hugging Face)
- Welcome to vLLM! — vLLM Documentation
- Serving Large Language Models: Technologies and Choices (run.ai)
- Efficient Large Language Model Serving (arXiv)