Day 72 of 133

Tokenization deep dive (BPE step-by-step) + DSA Greedy

Why tokenization breaks math; vocab size trade-offs.

DSA · NeetCode Greedy

Maximum SubarrayDSA · Greedy
Interview questions to prep
1. Walk through Kadane's algorithm. State the loop invariant.
2. How does maximum-product-subarray differ — what extra state do you carry?
Jump GameDSA · Greedy
Interview questions to prep
1. Why does greedy (track max-reach) work? Where would DP be unnecessary?
2. How does jump-game-ii (min jumps) differ in approach?

BPE algorithm step-by-stepGenerative AIHF
Interview questions to prep
1. Walk through the BPE training algorithm.
2. Why does BPE result in different tokenizations for similar words across languages?
Vocabulary size trade-offsGenerative AIHF
Interview questions to prep
1. Why is vocabulary size a critical design choice — what does increasing it cost?
2. How does vocab size affect throughput and memory of the embedding + LM-head layers?
Tokenization quirks: numbers, code, multilingualGenerative AIRead
Interview questions to prep
1. Why do LLMs struggle with arithmetic, and how does tokenization contribute?
2. Why are non-Latin-script languages disproportionately expensive to serve, and how do you fix it?
3. Before fine-tuning on domain documents, how would you decide whether domain-specific terms need tokenizer vocabulary changes?

References & further reading