Day 72 of 133

Tokenization deep dive (BPE step-by-step) + DSA Greedy

Why tokenization breaks math; vocab size trade-offs.

DSA · NeetCode Greedy

  • Maximum SubarrayDSA · Greedy

    Interview questions to prep

    1. Walk through Kadane's algorithm. State the loop invariant.
    2. How does maximum-product-subarray differ — what extra state do you carry?
  • Jump GameDSA · Greedy

    Interview questions to prep

    1. Why does greedy (track max-reach) work? Where would DP be unnecessary?
    2. How does jump-game-ii (min jumps) differ in approach?

GenAI · Tokenization deep-dive

  • Interview questions to prep

    1. Walk through the BPE training algorithm.
    2. Why does BPE result in different tokenizations for similar words across languages?
  • Interview questions to prep

    1. Why is vocabulary size a critical design choice — what does increasing it cost?
    2. How does vocab size affect throughput and memory of the embedding + LM-head layers?
  • Interview questions to prep

    1. Why do LLMs struggle with arithmetic, and how does tokenization contribute?
    2. Why are non-Latin-script languages disproportionately expensive to serve, and how do you fix it?
    3. Before fine-tuning on domain documents, how would you decide whether domain-specific terms need tokenizer vocabulary changes?

References & further reading