Day 72 of 133
Tokenization deep dive (BPE step-by-step) + DSA Greedy
Why tokenization breaks math; vocab size trade-offs.
DSA · NeetCode Greedy
- Maximum SubarrayDSA · Greedy
Interview questions to prep
- Walk through Kadane's algorithm. State the loop invariant.
- How does maximum-product-subarray differ — what extra state do you carry?
- Jump GameDSA · Greedy
Interview questions to prep
- Why does greedy (track max-reach) work? Where would DP be unnecessary?
- How does jump-game-ii (min jumps) differ in approach?
GenAI · Tokenization deep-dive
Interview questions to prep
- Walk through the BPE training algorithm.
- Why does BPE result in different tokenizations for similar words across languages?
Interview questions to prep
- Why is vocabulary size a critical design choice — what does increasing it cost?
- How does vocab size affect throughput and memory of the embedding + LM-head layers?
Interview questions to prep
- Why do LLMs struggle with arithmetic, and how does tokenization contribute?
- Why are non-Latin-script languages disproportionately expensive to serve, and how do you fix it?
- Before fine-tuning on domain documents, how would you decide whether domain-specific terms need tokenizer vocabulary changes?
References & further reading
- Hugging Face LLM course ↗Hugging Face
- Maxime Labonne — LLM Course ↗GitHub