From-scratch PagedAttention + continuous batching serving TinyLlama on an A10G. Built the same ideas behind vLLM. 21.2x faster than HuggingFace at 64 concurrent requests.