Day 99 of 133
Speech: ASR (Whisper) + TTS + streaming
Encoder-decoder vs CTC; Tacotron / FastSpeech; streaming latency.
DSA · NeetCode Trees
- Binary Tree Level Order TraversalDSA · Trees
Interview questions to prep
- Walk through BFS with queue. How do you cleanly separate one level from the next?
- Can you do this DFS-recursively while still grouping by level?
Specialization · Speech (ASR, TTS)
Interview questions to prep
- Walk through Whisper's encoder-decoder design and weakly-supervised training data.
- Compare CTC vs attention-based seq2seq for ASR.
Interview questions to prep
- Compare two-stage (text → mel → wav) vs end-to-end TTS.
- Why does prosody / expressive control remain hard in TTS, and how do modern systems handle it?
Interview questions to prep
- What's the trade-off between latency and accuracy in streaming ASR?
- How would you design a chunked / look-ahead streaming ASR to keep latency under 300ms?
References & further reading
- Whisper paper (ASR) ↗OpenAI
- Papers with Code — SOTA leaderboards ↗Papers with Code