Day 95 of 133
Inference optimization: quantization, distillation, pruning, batching
PTQ vs QAT, GPTQ/AWQ, continuous batching.
DSA · NeetCode Math & Geometry
- Powx NDSA · Math & Geometry
Interview questions to prep
- Where does integer overflow / negative input / zero hide here, and how do you guard against it?
- Can you derive a closed-form solution, and how does it compare to the iterative one?
- Walk through edge cases: 0, 1, max int, min int, negative input.
Infra · Inference optimization
Interview questions to prep
- Compare post-training quantization (PTQ) vs quantization-aware training (QAT).
- How do GPTQ and AWQ work, and what quality do you lose?
- Why can INT8 or INT4 quantization maintain quality while reducing memory and latency?
- Why are INT1 / 1-bit approaches much harder to deploy than standard integer quantization?
- Walk through quantize and dequantize math: scale, zero point, clipping, and calibration.
Interview questions to prep
- Walk me through knowledge distillation — what is the soft-target loss?
- Why does temperature in the soft target matter, and how do you pick it?
Interview questions to prep
- Compare structured vs unstructured pruning — which actually speeds up inference?
- Why does unstructured pruning rarely move latency on GPU, even when sparsity is high?
Interview questions to prep
- How does continuous batching beat static batching for LLM serving?
- What's the trade-off between max batch size and per-request latency under continuous batching?
- An internal Llama 3 8B assistant suddenly hits thousands of requests per second. What serving changes would you prioritize before buying more GPUs?
Interview questions to prep
- Where do OpenVINO, ONNX Runtime, and TensorRT fit relative to model-level quantization?
- How would you choose an inference runtime for CPU, edge, and NVIDIA GPU deployments?
- What benchmarking would prove that a runtime optimization improved real P95 latency rather than only microbenchmarks?
Interview questions to prep
- Why does KV-cache memory become the bottleneck for long-context LLM serving?
- How does PagedAttention reduce memory fragmentation compared with a naive KV cache?
- What serving knobs would you tune when long prompts cause out-of-memory errors?
References & further reading
- vLLM docs ↗vLLM
- MIT — ML Efficiency playlist ↗MIT
- 75Hard GenAI/LLM — LLM quantization complete guide ↗75Hard GenAI/LLM Challenge
- OpenVINO documentation ↗Intel