vLLM is an open-source inference engine designed for serving large language models (LLMs) with efficiency and scalability in mind. In this post, I’ll walk you through the lifecycle of an inference request as it traverses vLLM’s V1 architecture, reveal how vLLM maximizes GPU utilization, and highlight the key code components that power its performance.

Terminology

Here are some recurring terms, consistent with vLLM’s codebase and docs:

  • Request: An OpenAI-compatible chat completion message from the client.
  • Sequence: The stream of tokens—prompt and response—linked to a request.
  • Batching: Grouping multiple requests for joint inference to maximize GPU use.
  • Prefill: Processing prompt tokens, building the Key (K) and Value (V) tensors in the KV cache.
  • Decode: Generating output tokens by reusing the KV cache and computing new Query (Q) tensors.
  • KV Cache: GPU memory holding all transformer attention keys/values for every token across all requests.
  • KV Block: A fixed-size portion of the KV cache, minimizing memory overhead by efficient segmentation.

High-Level Architecture

vLLM is composed of several collaborating modules:

Component Purpose
AsyncLLM Interfaces with the API server, submits requests asynchronously to the core engine, and handles tokenization and detokenization.
EngineCore Handles request scheduling and execution, running the main inference busy loop.
Scheduler Determines batching and allocation of tokens for requests at each step.
ModelExecutor Orchestrates model loading and execution across GPU processes using ray.
ModelRunner Loads the model on each GPU and processes batches via CUDA cores.
KVCacheManager Manages GPU memory by treating KV cache as a paging system, allocating and maintaining KV cache blocks.

This modular design allows vLLM to operate robustly, scale horizontally, and deliver high throughput.

1. Receiving the Request – API Server and Async IPC

The lifecycle begins with an HTTP request, such as a POST to /v1/chat/completions, reaching the OpenAI-compatible API server. This server performs authentication, parses the request, and passes the data to the AsyncLLM engine via the generate() method.

AsyncLLM tokenizes the prompt (turning text into model-ready token IDs), then forwards the request to EngineCore using asynchronous inter-process communication (IPC). Since AsyncLLM and EngineCore run as separate processes, vLLM side steps Python’s Global Interpreter Lock (GIL), allowing true CPU-GPU parallelism.

2. Scheduling and Continuous Batching

The Scheduler is responsible for managing all incoming requests and orchestrating them into batches for efficient GPU use.

  • Continuous Batching: Rather than static or fixed-size batches, vLLM intelligently adjusts the batch size per cycle to fit a configurable token budget. This maximizes GPU utilization and maintains fair request ordering.
  • Prefill vs. Decode: Initially, requests are in the prefill phase, processing all prompt tokens as a batch. As soon as prompts are finished, requests move to the decoding phase—generating tokens one by one.
  • KV Cache Blocks: The KVCacheManager allocates or retrieves KV cache blocks for requests as determined by the scheduler, keeping GPU memory usage flexible and efficient.

Example:
Suppose three requests have 3, 5, and 12 prompt tokens, respectively, and the token budget is 10. vLLM will intelligently batch 3 from #1, 5 from #2, and 2 from #3 at step zero. The remaining tokens are processed in subsequent iterations, minimizing idle GPU time and balancing latency.

3. Model Execution – Forward Pass on the GPU

After scheduling, ModelExecutor and each ModelRunner distribute and execute the forward passes on the GPUs:

  • All tokens from the selected batch are merged into a single tensor for highly parallel computation.
  • Attention tensors (K, V, Q) are computed at every transformer layer, leveraging FlashAttention for maximum speed.
  • For decode steps, logits from the model output are run through a decoding strategy (greedy, sampling, etc.) to select the next token.
  • The results for all sequences are written to an internal output queue.

Each iteration of this busy loop packs as much work as possible into every GPU-forward pass, ensuring state-of-the-art throughput.

4. Output Processing – Streaming Tokens Back

  • AsyncLLM retrieves the generated tokens, detokenizes them, and returns them to the API server.
  • In streaming mode, partial outputs are sent immediately to the client as soon as tokens are ready, ensuring low latency for interactive applications.
  • In non-streaming mode, the complete result is gathered and sent once the request is finished.

Key Design Innovations

  • Paged KV Cache: The innovation of breaking GPU memory into fixed-size blocks (pages) for KV cache, as introduced in the PagedAttention blog earlier allows vLLM to handle many concurrent requests without memory fragmentation or waste.
  • Async, Multi-Process Architecture: The sharp separation between API, engine, and GPU execution allows for efficient utilization of all hardware.
  • Continuous Batching: This technique allows vLLM to approach GPU saturation even with varying request sizes and rates.

Conclusion

By following an inference request from API ingress to streamed token egress, we see how vLLM achieves high performance for LLM serving—efficiently batching workloads, maximizing GPU use, and delivering prompt results at scale.