Day 91 of 133

LLMOps: caching, routing, vLLM serving

Exact vs semantic caching; cascade routing; PagedAttention.

DSA · NeetCode Intervals

  • Meeting Rooms IIDSA · Intervals

    Interview questions to prep

    1. Compare heap-based (O(n log n)) vs sweep-line (start/end events) approaches.
    2. What if you need to assign each meeting to a specific room, not just count?

LLMOps · Caching, routing, cost

  • Interview questions to prep

    1. Compare exact-match prompt caching vs semantic caching — when does each fit?
    2. How would you measure semantic-cache safety — what's the false-hit failure mode?
  • Interview questions to prep

    1. How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
    2. Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
  • Interview questions to prep

    1. A demo works perfectly but production fails. What does a robust LLM harness include beyond the prompt?
    2. How would you make LLM behavior reproducible, debuggable, and regression-tested across prompt, tool, and model changes?
    3. Where do you place structured evals, tracing, guardrails, retries, and human feedback in a production assistant?
    4. What signals would tell you the issue is orchestration or observability rather than model quality?
  • Interview questions to prep

    1. What does vLLM's PagedAttention do for throughput?
    2. Compare vLLM vs TensorRT-LLM vs SGLang.
  • Interview questions to prep

    1. When would you run an LLM locally instead of calling a hosted API?
    2. What changes when serving a quantized local model through Ollama compared with a GPU-backed vLLM service?
    3. How would you evaluate privacy, latency, context length, and update cadence for local LLM deployment?
  • Interview questions to prep

    1. How would you diagnose high first-token latency vs high tokens-per-second latency?
    2. How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
    3. What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?
    4. Why do GenAI systems become slow even when the base model is fast?
    5. How would you budget latency across retrieval, filtering, tool calls, validation, retries, and generation?

References & further reading