Day 98 of 133
Infra wrap + LLMOps consolidation
Cross-link distributed training + serving + cost questions.
DSA · NeetCode Greedy
- Merge Triplets TO Form Target TripletDSA · Greedy
Interview questions to prep
- Prove the greedy choice — why is the locally-optimal pick safe globally? (Exchange argument or staying-ahead.)
- When does greedy fail on a similar-looking problem, and what would you reach for instead (DP, BFS)?
- Walk through edge cases that often break naive greedy: ties, negatives, single element.
LLMOps · Caching, routing, cost
Interview questions to prep
- Compare exact-match prompt caching vs semantic caching — when does each fit?
- How would you measure semantic-cache safety — what's the false-hit failure mode?
Interview questions to prep
- How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
- Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
Interview questions to prep
- What does vLLM's PagedAttention do for throughput?
- Compare vLLM vs TensorRT-LLM vs SGLang.
Interview questions to prep
- How would you diagnose high first-token latency vs high tokens-per-second latency?
- How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
- What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?
MLOps · Cost & scaling
Interview questions to prep
- How would you model the unit cost of a prediction in production?
- What levers reduce inference cost (batching, quantization, caching, distillation)?
Interview questions to prep
- Compare CPU-based HPA vs queue-based KEDA scaling for ML inference.
- Why does GPU-pinned inference often defeat HPA, and how do you actually scale GPU pods?
Interview questions to prep
- How would you train safely on spot instances (checkpointing, retries)?
- When does spot training become NET more expensive than on-demand — what's the breakeven?
References & further reading
- vLLM docs ↗vLLM
- Eugene Yan — applied ML writing ↗Eugene Yan