Day 97 of 133

Infra mocks: pick a serving design + answer cost / SLO trade-offs

Self-mock 30-min: 'serve a 70B model at 100 QPS, $X budget, P99 < Y'.

DSA · NeetCode 2-D DP

  • Interleaving StringDSA · 2-D DP

    Interview questions to prep

    1. State the 2-D DP: indices, recurrence, base case. What's the order of fill?
    2. Can you reduce 2-D to 1-D by reusing rows or columns? Walk through the dependency direction.
    3. Top-down with memoization vs bottom-up — which is easier to reason about, and which is faster in practice?

MLOps · Serving infrastructure

  • Interview questions to prep

    1. Compare TorchServe, Triton, and BentoML — when does each fit?
    2. What is dynamic batching and why does it matter for GPU utilization?
  • Interview questions to prep

    1. Walk through deploying a model on Kubernetes with autoscaling.
    2. When do you reach for KServe vs custom Deployment + HPA?
  • Interview questions to prep

    1. How would you do a safe rollback when a new model regresses online metrics?
    2. What guardrail metrics would automatically trigger a rollback without a human in the loop?

Infra · Inference optimization

  • Interview questions to prep

    1. Compare post-training quantization (PTQ) vs quantization-aware training (QAT).
    2. How do GPTQ and AWQ work, and what quality do you lose?
  • Interview questions to prep

    1. Walk me through knowledge distillation — what is the soft-target loss?
    2. Why does temperature in the soft target matter, and how do you pick it?
  • Interview questions to prep

    1. Compare structured vs unstructured pruning — which actually speeds up inference?
    2. Why does unstructured pruning rarely move latency on GPU, even when sparsity is high?
  • Interview questions to prep

    1. How does continuous batching beat static batching for LLM serving?
    2. What's the trade-off between max batch size and per-request latency under continuous batching?
  • Interview questions to prep

    1. Why does KV-cache memory become the bottleneck for long-context LLM serving?
    2. How does PagedAttention reduce memory fragmentation compared with a naive KV cache?
    3. What serving knobs would you tune when long prompts cause out-of-memory errors?

References & further reading