Day 97 of 133

Infra mocks: pick a serving design + answer cost / SLO trade-offs

Self-mock 30-min: 'serve a 70B model at 100 QPS, $X budget, P99 < Y'.

DSA · NeetCode 2-D DP

Interleaving StringDSA · 2-D DP
Interview questions to prep
1. State the 2-D DP: indices, recurrence, base case. What's the order of fill?
2. Can you reduce 2-D to 1-D by reusing rows or columns? Walk through the dependency direction.
3. Top-down with memoization vs bottom-up — which is easier to reason about, and which is faster in practice?

Model servers: TorchServe, Triton, BentoML, SeldonMLOpsTriton
Interview questions to prep
1. Compare TorchServe, Triton, and BentoML — when does each fit?
2. What is dynamic batching and why does it matter for GPU utilization?
Containerization & Kubernetes for MLMLOpsKubeflow
Interview questions to prep
1. Walk through deploying a model on Kubernetes with autoscaling.
2. When do you reach for KServe vs custom Deployment + HPA?
Traffic routing, model rollback, A/B servingMLOpsKServe
Interview questions to prep
1. How would you do a safe rollback when a new model regresses online metrics?
2. What guardrail metrics would automatically trigger a rollback without a human in the loop?

Quantization: INT8, INT4, FP8, GPTQ, AWQMLOpsHF
Interview questions to prep
1. Compare post-training quantization (PTQ) vs quantization-aware training (QAT).
2. How do GPTQ and AWQ work, and what quality do you lose?
Knowledge distillation (teacher → student)MLOpsHinton et al.
Interview questions to prep
1. Walk me through knowledge distillation — what is the soft-target loss?
2. Why does temperature in the soft target matter, and how do you pick it?
Pruning: structured vs unstructuredMLOpsPyTorch
Interview questions to prep
1. Compare structured vs unstructured pruning — which actually speeds up inference?
2. Why does unstructured pruning rarely move latency on GPU, even when sparsity is high?
Continuous batching & dynamic batchingMLOpsAnyscale
Interview questions to prep
1. How does continuous batching beat static batching for LLM serving?
2. What's the trade-off between max batch size and per-request latency under continuous batching?
KV cache, PagedAttention, and memory-aware servingMLOpsvLLM
Interview questions to prep
1. Why does KV-cache memory become the bottleneck for long-context LLM serving?
2. How does PagedAttention reduce memory fragmentation compared with a naive KV cache?
3. What serving knobs would you tune when long prompts cause out-of-memory errors?

References & further reading