📄️ Background
This project, conducted as my Final Year Project at NTU, re-engineers the vLLM worker-controller architecture. My goal was to reduce cold start latency in large language model inference. You can view the repository here
📄️ Proposed Solution
Worker Controller
📄️ Technical Implementation
The Worker Controller project was built with a focus on leveraging Python's native capabilities for process management and inter-process communication, primarily utilizing the multiprocessing module. This approach allowed for fine-grained control over GPU worker lifecycles and efficient resource sharing.
📄️ Results
Summary