Day 91 of 133
LLMOps: caching, routing, vLLM serving
Exact vs semantic caching; cascade routing; PagedAttention.
DSA · NeetCode Intervals
- Meeting Rooms IIDSA · Intervals
Interview questions to prep
- Compare heap-based (O(n log n)) vs sweep-line (start/end events) approaches.
- What if you need to assign each meeting to a specific room, not just count?
LLMOps · Caching, routing, cost
Interview questions to prep
- Compare exact-match prompt caching vs semantic caching — when does each fit?
- How would you measure semantic-cache safety — what's the false-hit failure mode?
Interview questions to prep
- How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
- Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
Interview questions to prep
- A demo works perfectly but production fails. What does a robust LLM harness include beyond the prompt?
- How would you make LLM behavior reproducible, debuggable, and regression-tested across prompt, tool, and model changes?
- Where do you place structured evals, tracing, guardrails, retries, and human feedback in a production assistant?
- What signals would tell you the issue is orchestration or observability rather than model quality?
Interview questions to prep
- What does vLLM's PagedAttention do for throughput?
- Compare vLLM vs TensorRT-LLM vs SGLang.
Interview questions to prep
- When would you run an LLM locally instead of calling a hosted API?
- What changes when serving a quantized local model through Ollama compared with a GPU-backed vLLM service?
- How would you evaluate privacy, latency, context length, and update cadence for local LLM deployment?
Interview questions to prep
- How would you diagnose high first-token latency vs high tokens-per-second latency?
- How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
- What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?
- Why do GenAI systems become slow even when the base model is fast?
- How would you budget latency across retrieval, filtering, tool calls, validation, retries, and generation?
References & further reading
- vLLM docs ↗vLLM
- LangSmith — LLM tracing & eval ↗LangChain
- Anthropic — Prompt Engineering Guide ↗Anthropic
- Ollama — local LLM runtime ↗Ollama