Day 121 of 133

Multi-model routing & cascades + DSA review

Cheap-then-expensive; verifier models; budget-aware routing.

DSA · NeetCode Backtracking

Word SearchDSA · Backtracking
Interview questions to prep
1. Walk through DFS with a 'visited' marker on the board (in-place vs aux). Trade-offs?
2. How does this scale to 'word search ii' with a trie of many target words?

Latency vs accuracy: caching, distillation, cascadesML System DesignEugene Yan
Interview questions to prep
1. What levers do you pull when accuracy is great but latency misses the budget?
2. Walk through where you'd add caching in a RAG + LLM pipeline to halve P99.
Multi-model routing & cascadesML System DesignAnyscale
Interview questions to prep
1. How would you design a cascade: cheap model first, expensive only when needed?
2. What's the right verifier for the cheap model's output — and when does it dominate cost?
Cold-start strategies (recsys, search, ads)ML System DesignEugene Yan
Interview questions to prep
1. Walk through cold-start strategies for new users vs new items.
2. Compare bandit-based exploration vs content-based bridges for cold start — when does each fit?
Privacy-preserving ML: federated, DP, on-deviceML System DesignGoogle
Interview questions to prep
1. When would you reach for federated learning vs differential privacy vs on-device inference?
2. What's the accuracy cost of DP-SGD at typical ε values, and how do you decide if it's acceptable?

Prompt caching & semantic cachingML System DesignAnthropic
Interview questions to prep
1. Compare exact-match prompt caching vs semantic caching — when does each fit?
2. How would you measure semantic-cache safety — what's the false-hit failure mode?
Model routing: cheap-then-expensive cascadesML System DesignAnyscale
Interview questions to prep
1. How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
2. Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
Inference servers: vLLM, TensorRT-LLM, SGLangML System DesignvLLM
Interview questions to prep
1. What does vLLM's PagedAttention do for throughput?
2. Compare vLLM vs TensorRT-LLM vs SGLang.
Latency debugging, rate limits, and concurrency controlsML System DesignvLLM
Interview questions to prep
1. How would you diagnose high first-token latency vs high tokens-per-second latency?
2. How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
3. What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?

References & further reading