Day 71 of 133

LLM foundations: pretraining, scaling laws, KV cache + DSA 2-D DP

Chinchilla scaling, decoder-only at inference, emergent abilities debate.

DSA · NeetCode 2-D DP

Regular Expression MatchingDSA · 2-D DP
Interview questions to prep
1. State the 2-D DP: indices, recurrence, base case. What's the order of fill?
2. Can you reduce 2-D to 1-D by reusing rows or columns? Walk through the dependency direction.
3. Top-down with memoization vs bottom-up — which is easier to reason about, and which is faster in practice?

Pretraining: data, scaling laws (Chinchilla)Generative AIHoffmann et al.
Interview questions to prep
1. Walk me through Chinchilla scaling laws — what's the data:parameters ratio?
2. Why has 'compute-optimal' training overtaken 'parameter-optimal' as the design target?
3. How would you turn noisy API or Wikipedia data into a pretraining corpus without contaminating evaluation?
Build a GPT dataset for next-token predictionGenerative AINeetCode ML
Interview questions to prep
1. How do you turn raw text into input-target pairs for GPT next-token prediction?
2. What are block size, context window, and stride in a language-model dataset?
3. How do train/validation splits prevent contamination in LLM pretraining or fine-tuning?
Decoder-only transformer for LLMsGenerative AIJay Alammar
Interview questions to prep
1. Walk me through one forward pass of a decoder-only LLM at inference time.
2. What is the KV cache and why is it so important?
3. Why do modern LLMs need billions of parameters — what capacity, memorization, and generalization trade-offs are involved?
4. What changes architecturally or operationally when you move from a small language model to a frontier-scale LLM?
Emergent abilities & in-context learningGenerative AIWei et al.
Interview questions to prep
1. Define 'emergent abilities' in LLMs — and why some researchers say they're a measurement artifact.
2. What does the 'mirage' paper claim, and how does the choice of metric drive apparent emergence?

References & further reading