Post

How to Efficiently Serve an LLM?

How to Efficiently Serve an LLM

LLMs, or Large Language Models, are named so because they can range from tens to hundreds of billions of parameters. Their utility is clear, as LLMs are setting new records on various benchmarks and now often match or exceed human performance in multiple tasks GPT-4 Technical Report. Consequently, many companies are eager to deploy them in production. However, due to the unprecedented size of LLMs, there are significant challenges in serving them, such as slow token generation (tokens/second), memory limits for loading model parameters, KV cache (explained later), compute limits, and more. In this article, we will cover several recent ideas to help set up a robust LLM serving system.

LLM Serving OverviewLLM Serving Overview

LLM inference Steps

  1. Multiple users send requests to the LLM Server through HTTPs/gRPC.
  2. The LLM Server receives the requests and schedules them based on QoE definition.
    • QoE (Quality of Experience): Defined by two metrics:
      • TTFT (Time to First Token): The time it takes for a user to receive the first response token.
      • TDS (Token Delivery Speed): The rate at which the user receives tokens, which should be uniform and above the reader’s reading speed for a positive user experience.

    QoEQoE Aware LLM Serving

  3. After scheduling, the LLM Inference process is divided into two phases:
    • Prefill phase: The LLM processes the input tokens in parallel and generates the output activations known as the “KV Cache”. This step is highly efficient at utilizing the GPU’s parallel processing capabilities, making input tokens generally much cheaper than output tokens (as seen in the GPT-4o pricing chart). This phase produces the first output token and is typically compute-bound.

    • gpt-4o pricing GPT-4o Pricing

    • Decode phase: The LLM starts autoregressively generating output tokens one at a time. This phase is slower in terms of inference and is where optimizations are necessary. Output tokens at each step are concatenated with the previous tokens’ KV cache to generate the next token.

    • KV Cache KV Cache Explanation & Reuse

Optimizations

Many experts are innovating the inference stack, and multiple startups are competing to reduce costs to attract more customers.

  • llama405-pricingLLAMA 405 Pricing by Different Providers

Here are some interesting optimizations shared recently in research:

  1. Batching:
    • Instead of serving one request at a time and wasting compute resources (since the decode phase has low arithmetic intensity and is memory-bound), we can amortize the cost of retrieving weights and KV cache from memory by serving multiple requests simultaneously.
    • Continuous BatchingContinuous Batching
  2. Model Quantization (FP8/INT8):
    • Decreasing the precision of model weights and/or activations (AWQ/GPTQ) frees up more GPU VRAM, which allows for serving larger batches of requests.
    • Model QuantizationModel Quantization
  3. Paged Attention:
    • The core idea behind vLLM, the most popular open-source serving engine, is to avoid memory fragmentation that occurs due to preserving the max context length for every request by using paging (borrowed from OS paging) to manage memory efficiently.
    • Paged Attention Paged Attention in vLLM
  4. Prefill Chunking / Stale-free Batching:
    • Proposed by the Sarathi-Serve paper, dividing the prefill context into smaller chunks allows merging the prefill and decode phases of different requests in the same batch.
    • Sarathi-Serve Prefill Decode Prioritizing
  5. Prefill/Decode Disaggregation:
  6. KVCache Compression:
    • As proposed by CacheGen, compressing the KVCache to speed up network transfer. This approach is beneficial for use cases with large context lengths (i.e., content summarization) which are over 16k input tokens to justify the encoding/decoding CPU overhead.
    • KVCache Compression KV Cache Compression
  7. Speculative Decoding:
    • Using extra smaller model(s) that generate tokens fast and in parallel. Selecting the output that matches the original model can speed up inference for simple use cases. Note that as the request batch size increases, the speed-up of speculative decoding diminishes.
    • Speculative DecodingSpeculative Decoding
  8. Radix Attention (Prefix Caching):
    • This is the idea behind SGLang (SGLang: Efficient Execution of Structured Language Model Programs), which involves creating a data structure similar to a Prefix tree (Trie) for the KVCache to help reuse KVCache without recomputation. This only works for some use cases, like those shown in the image below:
    • Radix AttentionKV Cache Sharing Examples
  9. Early Rejection:
    • Predicting if a request can be served once received to avoid wasted resources (i.e., the server successfully computed the prefill part but failed at the decode phase due to memory limitations) will help improve server resource utilization and prevent downtime.
    • Early RejectionEarly Rejection based on Prediction

Conclusion

Efficiently serving large language models is essential for businesses to reduce costs and increase generation speed (tokens/second). This efficiency opens the door for more use cases for LLMs. With the ideas presented here, you can optimize your LLM inference stack to achieve these goals and more!

References

  1. Improving LLM Inference with Prefill Chunking / Stale-free batching (USENIX)
  2. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
  3. KVCache Compression and Streaming for Faster LLM Serving (arXiv)
  4. Dynamic Memory Management for LLMs: vAttention (arXiv)
  5. Enhancing Quality-of-Experience in LLM-Based Services (arXiv)
  6. Prefix Caching for Efficient LLM Inference (arXiv)
  7. Mastering LLM Techniques: Inference Optimization (NVIDIA Technical Blog)
  8. Token Probability Distribution (Hugging Face)
  9. Welcome to vLLM! — vLLM Documentation
  10. Serving Large Language Models: Technologies and Choices (run.ai)
  11. Efficient Large Language Model Serving (arXiv)
This post is licensed under CC BY 4.0 by the author.