Skip to main content

Proposed Solution

Worker Controller

Deep Dive

To overcome the cold-start latency inherent in the default vLLM architecture, we introduced a Worker Controller. This component fundamentally shifts the resource management paradigm from "process-per-model" to "process-as-a-resource."

Key Innovations

  • Worker Pooling: Instead of spawning new processes for every model, we maintain a persistent pool of pre-initialized GPU workers (processes with CUDA context already established).
  • Dynamic Binding: vLLM engines are modified to dynamically attach to these pre-warmed workers on demand, rather than creating them during initialization.
  • Inter-Process Communication (IPC): A robust IPC mechanism allows the runtime API servers to command these pre-existing workers, facilitating efficient instruction and data transfer.

High Level Architecture

High Level Architecture

In this new architecture, the Worker Controller acts as the orchestrator:

  1. Initialization: At system startup, the Controller spawns a configurable number of "dummy" workers. These workers initialize their Python environments and CUDA contexts but do not load any model weights.
  2. Request Handling: When a request to serve a specific model (e.g., Llama-2-7b) arrives, the Controller identifies available workers from the pool.
  3. Assignment: The selected workers are assigned to the new Engine instance.
  4. Model Loading: The workers load the specific model weights. Note that while weight loading is still necessary, the expensive process startup and CUDA initialization overhead is completely eliminated.