Day 99 of 133

Speech: ASR (Whisper) + TTS + streaming

Encoder-decoder vs CTC; Tacotron / FastSpeech; streaming latency.

DSA · NeetCode Trees

  • Interview questions to prep

    1. Walk through BFS with queue. How do you cleanly separate one level from the next?
    2. Can you do this DFS-recursively while still grouping by level?

Specialization · Speech (ASR, TTS)

  • ASR: from HMMs to WhisperDeep LearningOpenAI Whisper

    Interview questions to prep

    1. Walk through Whisper's encoder-decoder design and weakly-supervised training data.
    2. Compare CTC vs attention-based seq2seq for ASR.
  • Interview questions to prep

    1. Compare two-stage (text → mel → wav) vs end-to-end TTS.
    2. Why does prosody / expressive control remain hard in TTS, and how do modern systems handle it?
  • Interview questions to prep

    1. What's the trade-off between latency and accuracy in streaming ASR?
    2. How would you design a chunked / look-ahead streaming ASR to keep latency under 300ms?

References & further reading