LLM Inference Engine

From-scratch PagedAttention + continuous batching serving TinyLlama on an A10G. Built the same ideas behind vLLM. 21.2x faster than HuggingFace at 64 concurrent requests.

Attention mechanisms
Supervised vs unsupervised
Gradient descent
Backpropagation
Transformers
Response length 150
Output will appear here...