Day 121 of 133
Multi-model routing & cascades + DSA review
Cheap-then-expensive; verifier models; budget-aware routing.
DSA · NeetCode Backtracking
- Word SearchDSA · Backtracking
Interview questions to prep
- Walk through DFS with a 'visited' marker on the board (in-place vs aux). Trade-offs?
- How does this scale to 'word search ii' with a trie of many target words?
ML System Design · Cross-cutting trade-offs
Interview questions to prep
- What levers do you pull when accuracy is great but latency misses the budget?
- Walk through where you'd add caching in a RAG + LLM pipeline to halve P99.
Interview questions to prep
- How would you design a cascade: cheap model first, expensive only when needed?
- What's the right verifier for the cheap model's output — and when does it dominate cost?
Interview questions to prep
- Walk through cold-start strategies for new users vs new items.
- Compare bandit-based exploration vs content-based bridges for cold start — when does each fit?
Interview questions to prep
- When would you reach for federated learning vs differential privacy vs on-device inference?
- What's the accuracy cost of DP-SGD at typical ε values, and how do you decide if it's acceptable?
LLMOps · Caching, routing, cost
Interview questions to prep
- Compare exact-match prompt caching vs semantic caching — when does each fit?
- How would you measure semantic-cache safety — what's the false-hit failure mode?
Interview questions to prep
- How would you route requests across GPT-5, Claude 4.5, and a small open-source model?
- Walk through how a verifier model gates the cheap-model output before falling back to the expensive one.
Interview questions to prep
- What does vLLM's PagedAttention do for throughput?
- Compare vLLM vs TensorRT-LLM vs SGLang.
Interview questions to prep
- How would you diagnose high first-token latency vs high tokens-per-second latency?
- How do rate limits, concurrency limits, queues, and retries interact in an LLM API gateway?
- What metrics tell you whether the bottleneck is prompt length, model compute, KV cache pressure, or downstream tools?
References & further reading
- vLLM docs ↗vLLM
- Eugene Yan — applied ML writing ↗Eugene Yan