Day 97 of 133
Infra mocks: pick a serving design + answer cost / SLO trade-offs
Self-mock 30-min: 'serve a 70B model at 100 QPS, $X budget, P99 < Y'.
DSA · NeetCode 2-D DP
- Interleaving StringDSA · 2-D DP
Interview questions to prep
- State the 2-D DP: indices, recurrence, base case. What's the order of fill?
- Can you reduce 2-D to 1-D by reusing rows or columns? Walk through the dependency direction.
- Top-down with memoization vs bottom-up — which is easier to reason about, and which is faster in practice?
MLOps · Serving infrastructure
Interview questions to prep
- Compare TorchServe, Triton, and BentoML — when does each fit?
- What is dynamic batching and why does it matter for GPU utilization?
Interview questions to prep
- Walk through deploying a model on Kubernetes with autoscaling.
- When do you reach for KServe vs custom Deployment + HPA?
Interview questions to prep
- How would you do a safe rollback when a new model regresses online metrics?
- What guardrail metrics would automatically trigger a rollback without a human in the loop?
Infra · Inference optimization
Interview questions to prep
- Compare post-training quantization (PTQ) vs quantization-aware training (QAT).
- How do GPTQ and AWQ work, and what quality do you lose?
Interview questions to prep
- Walk me through knowledge distillation — what is the soft-target loss?
- Why does temperature in the soft target matter, and how do you pick it?
Interview questions to prep
- Compare structured vs unstructured pruning — which actually speeds up inference?
- Why does unstructured pruning rarely move latency on GPU, even when sparsity is high?
Interview questions to prep
- How does continuous batching beat static batching for LLM serving?
- What's the trade-off between max batch size and per-request latency under continuous batching?
Interview questions to prep
- Why does KV-cache memory become the bottleneck for long-context LLM serving?
- How does PagedAttention reduce memory fragmentation compared with a naive KV cache?
- What serving knobs would you tune when long prompts cause out-of-memory errors?
References & further reading