Day 95 of 133

Inference optimization: quantization, distillation, pruning, batching

PTQ vs QAT, GPTQ/AWQ, continuous batching.

DSA · NeetCode Math & Geometry

Powx NDSA · Math & Geometry
Interview questions to prep
1. Where does integer overflow / negative input / zero hide here, and how do you guard against it?
2. Can you derive a closed-form solution, and how does it compare to the iterative one?
3. Walk through edge cases: 0, 1, max int, min int, negative input.

Infra · Inference optimization

Quantization: INT8, INT4, FP8, GPTQ, AWQMLOpsHF
Interview questions to prep
1. Compare post-training quantization (PTQ) vs quantization-aware training (QAT).
2. How do GPTQ and AWQ work, and what quality do you lose?
3. Why can INT8 or INT4 quantization maintain quality while reducing memory and latency?
4. Why are INT1 / 1-bit approaches much harder to deploy than standard integer quantization?
5. Walk through quantize and dequantize math: scale, zero point, clipping, and calibration.
Knowledge distillation (teacher → student)MLOpsHinton et al.
Interview questions to prep
1. Walk me through knowledge distillation — what is the soft-target loss?
2. Why does temperature in the soft target matter, and how do you pick it?
Pruning: structured vs unstructuredMLOpsPyTorch
Interview questions to prep
1. Compare structured vs unstructured pruning — which actually speeds up inference?
2. Why does unstructured pruning rarely move latency on GPU, even when sparsity is high?
Continuous batching & dynamic batchingMLOpsAnyscale
Interview questions to prep
1. How does continuous batching beat static batching for LLM serving?
2. What's the trade-off between max batch size and per-request latency under continuous batching?
3. An internal Llama 3 8B assistant suddenly hits thousands of requests per second. What serving changes would you prioritize before buying more GPUs?
Optimized runtimes: OpenVINO, ONNX Runtime, TensorRTMLOpsOpenVINO
Interview questions to prep
1. Where do OpenVINO, ONNX Runtime, and TensorRT fit relative to model-level quantization?
2. How would you choose an inference runtime for CPU, edge, and NVIDIA GPU deployments?
3. What benchmarking would prove that a runtime optimization improved real P95 latency rather than only microbenchmarks?
KV cache, PagedAttention, and memory-aware servingMLOpsvLLM
Interview questions to prep
1. Why does KV-cache memory become the bottleneck for long-context LLM serving?
2. How does PagedAttention reduce memory fragmentation compared with a naive KV cache?
3. What serving knobs would you tune when long prompts cause out-of-memory errors?

References & further reading

vLLM docs ↗vLLM
MIT — ML Efficiency playlist ↗MIT
75Hard GenAI/LLM — LLM quantization complete guide ↗75Hard GenAI/LLM Challenge
OpenVINO documentation ↗Intel