Day 121 of 133

Multi-model routing & cascades + DSA review

Cheap-then-expensive; verifier models; budget-aware routing.

DSA · NeetCode Backtracking

  • Word SearchDSA · Backtracking

    Interview questions to prep

    1. Walk through DFS with a 'visited' marker on the board (in-place vs aux). Trade-offs?
    2. How does this scale to 'word search ii' with a trie of many target words?

ML System Design · Cross-cutting trade-offs

  • Interview questions to prep

    1. What levers do you pull when accuracy is great but latency misses the budget?
    2. Walk through where you'd add caching in a RAG + LLM pipeline to halve P99.
  • Multi-model routing & cascadesML System DesignAnyscale

    Interview questions to prep

    1. How would you design a cascade: cheap model first, expensive only when needed?
    2. What's the right verifier for the cheap model's output — and when does it dominate cost?
  • Interview questions to prep

    1. Walk through cold-start strategies for new users vs new items.
    2. Compare bandit-based exploration vs content-based bridges for cold start — when does each fit?
  • Interview questions to prep

    1. When would you reach for federated learning vs differential privacy vs on-device inference?
    2. What's the accuracy cost of DP-SGD at typical ε values, and how do you decide if it's acceptable?

LLMOps · Caching, routing, cost

  • Prompt caching & semantic cachingML System DesignAnthropic

    Interview questions to prep

    1. Compare exact-match prompt caching vs semantic caching — when does each fit?
    2. How would you measure semantic-cache safety — what's the false-hit failure mode?
  • Interview questions to prep

    1. How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
    2. Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
  • Interview questions to prep

    1. What does vLLM's PagedAttention do for throughput?
    2. Compare vLLM vs TensorRT-LLM vs SGLang.
  • Interview questions to prep

    1. How would you diagnose high first-token latency vs high tokens-per-second latency?
    2. How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
    3. What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?

References & further reading