Day 91 of 133

LLMOps: caching, routing, vLLM serving

Exact vs semantic caching; cascade routing; PagedAttention.

DSA · NeetCode Intervals

Meeting Rooms IIDSA · Intervals
Interview questions to prep
1. Compare heap-based (O(n log n)) vs sweep-line (start/end events) approaches.
2. What if you need to assign each meeting to a specific room, not just count?

LLMOps · Caching, routing, cost

Prompt caching & semantic cachingLLMOpsAnthropic
Interview questions to prep
1. Compare exact-match prompt caching vs semantic caching — when does each fit?
2. How would you measure semantic-cache safety — what's the false-hit failure mode?
Model routing: cheap-then-expensive cascadesLLMOpsAnyscale
Interview questions to prep
1. How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
2. Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
Production LLM harness: evals, traces, guardrails, feedback loopsLLMOpsLangSmith
Interview questions to prep
1. A demo works perfectly but production fails. What does a robust LLM harness include beyond the prompt?
2. How would you make LLM behavior reproducible, debuggable, and regression-tested across prompt, tool, and model changes?
3. Where do you place structured evals, tracing, guardrails, retries, and human feedback in a production assistant?
4. What signals would tell you the issue is orchestration or observability rather than model quality?
Inference servers: vLLM, TensorRT-LLM, SGLangLLMOpsvLLM
Interview questions to prep
1. What does vLLM's PagedAttention do for throughput?
2. Compare vLLM vs TensorRT-LLM vs SGLang.
Local LLM serving: Ollama, quantized model files, laptop/server trade-offsLLMOpsOllama
Interview questions to prep
1. When would you run an LLM locally instead of calling a hosted API?
2. What changes when serving a quantized local model through Ollama compared with a GPU-backed vLLM service?
3. How would you evaluate privacy, latency, context length, and update cadence for local LLM deployment?
Latency debugging, rate limits, and concurrency controlsLLMOpsvLLM
Interview questions to prep
1. How would you diagnose high first-token latency vs high tokens-per-second latency?
2. How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
3. What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?
4. Why do GenAI systems become slow even when the base model is fast?
5. How would you budget latency across retrieval, filtering, tool calls, validation, retries, and generation?

References & further reading

vLLM docs ↗vLLM
LangSmith — LLM tracing & eval ↗LangChain
Anthropic — Prompt Engineering Guide ↗Anthropic
Ollama — local LLM runtime ↗Ollama