Proposed Solution

Worker Controller

Deep Dive

To overcome the cold-start latency inherent in the default vLLM architecture, we introduced a Worker Controller. This component fundamentally shifts the resource management paradigm from "process-per-model" to "process-as-a-resource."

Key Innovations

Worker Pooling: Instead of spawning new processes for every model, we maintain a persistent pool of pre-initialized GPU workers (processes with CUDA context already established).
Dynamic Binding: vLLM engines are modified to dynamically attach to these pre-warmed workers on demand, rather than creating them during initialization.
Inter-Process Communication (IPC): A robust IPC mechanism allows the runtime API servers to command these pre-existing workers, facilitating efficient instruction and data transfer.

High Level Architecture

In this new architecture, the Worker Controller acts as the orchestrator:

Initialization: At system startup, the Controller spawns a configurable number of "dummy" workers. These workers initialize their Python environments and CUDA contexts but do not load any model weights.
Request Handling: When a request to serve a specific model (e.g., Llama-2-7b) arrives, the Controller identifies available workers from the pool.
Assignment: The selected workers are assigned to the new Engine instance.
Model Loading: The workers load the specific model weights. Note that while weight loading is still necessary, the expensive process startup and CUDA initialization overhead is completely eliminated.

Worker Controller​

Key Innovations​

High Level Architecture​

Worker Controller

Key Innovations

High Level Architecture