Day 71 of 133
LLM foundations: pretraining, scaling laws, KV cache + DSA 2-D DP
Chinchilla scaling, decoder-only at inference, emergent abilities debate.
DSA · NeetCode 2-D DP
- Regular Expression MatchingDSA · 2-D DP
Interview questions to prep
- State the 2-D DP: indices, recurrence, base case. What's the order of fill?
- Can you reduce 2-D to 1-D by reusing rows or columns? Walk through the dependency direction.
- Top-down with memoization vs bottom-up — which is easier to reason about, and which is faster in practice?
GenAI · LLM foundations
Interview questions to prep
- Walk me through Chinchilla scaling laws — what's the data:parameters ratio?
- Why has 'compute-optimal' training overtaken 'parameter-optimal' as the design target?
- How would you turn noisy API or Wikipedia data into a pretraining corpus without contaminating evaluation?
Interview questions to prep
- How do you turn raw text into input-target pairs for GPT next-token prediction?
- What are block size, context window, and stride in a language-model dataset?
- How do train/validation splits prevent contamination in LLM pretraining or fine-tuning?
Interview questions to prep
- Walk me through one forward pass of a decoder-only LLM at inference time.
- What is the KV cache and why is it so important?
- Why do modern LLMs need billions of parameters — what capacity, memorization, and generalization trade-offs are involved?
- What changes architecturally or operationally when you move from a small language model to a frontier-scale LLM?
Interview questions to prep
- Define 'emergent abilities' in LLMs — and why some researchers say they're a measurement artifact.
- What does the 'mirage' paper claim, and how does the choice of metric drive apparent emergence?
References & further reading
- Karpathy — Intro to LLMs ↗YouTube
- Maxime Labonne — LLM Course ↗GitHub
- Hugging Face LLM course ↗Hugging Face
- Illustrated Transformer (Jay Alammar) ↗Jay Alammar