Day 98 of 133

Infra wrap + LLMOps consolidation

Cross-link distributed training + serving + cost questions.

DSA · NeetCode Greedy

Merge Triplets TO Form Target TripletDSA · Greedy
Interview questions to prep
1. Prove the greedy choice — why is the locally-optimal pick safe globally? (Exchange argument or staying-ahead.)
2. When does greedy fail on a similar-looking problem, and what would you reach for instead (DP, BFS)?
3. Walk through edge cases that often break naive greedy: ties, negatives, single element.

Prompt caching & semantic cachingLLMOpsAnthropic
Interview questions to prep
1. Compare exact-match prompt caching vs semantic caching — when does each fit?
2. How would you measure semantic-cache safety — what's the false-hit failure mode?
Model routing: cheap-then-expensive cascadesLLMOpsAnyscale
Interview questions to prep
1. How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
2. Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
Inference servers: vLLM, TensorRT-LLM, SGLangLLMOpsvLLM
Interview questions to prep
1. What does vLLM's PagedAttention do for throughput?
2. Compare vLLM vs TensorRT-LLM vs SGLang.
Latency debugging, rate limits, and concurrency controlsLLMOpsvLLM
Interview questions to prep
1. How would you diagnose high first-token latency vs high tokens-per-second latency?
2. How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
3. What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?

Cost modeling: training vs inference, batching, autoscalingLLMOpsChip Huyen
Interview questions to prep
1. How would you model the unit cost of a prediction in production?
2. What levers reduce inference cost (batching, quantization, caching, distillation)?
Autoscaling: HPA, KEDA, queue-basedLLMOpsKEDA
Interview questions to prep
1. Compare CPU-based HPA vs queue-based KEDA scaling for ML inference.
2. Why does GPU-pinned inference often defeat HPA, and how do you actually scale GPU pods?
Spot/preemptible instances for trainingLLMOpsGoogle
Interview questions to prep
1. How would you train safely on spot instances (checkpointing, retries)?
2. When does spot training become NET more expensive than on-demand — what's the breakeven?

References & further reading