Day 93 of 133
Training efficiency: mixed precision, accumulation, checkpointing
fp16 vs bf16 vs fp8; activation checkpointing math.
DSA · NeetCode Heap / Priority Queue
- Kth Largest Element IN AN ArrayDSA · Heap / Priority Queue
Interview questions to prep
- Compare heap (O(n log k)), sort (O(n log n)), quickselect (O(n) avg) — when does each fit?
- What's quickselect's worst case, and how do you avoid it (median-of-medians, randomization)?
Infra · Training efficiency
Interview questions to prep
- Compare fp16 vs bf16 vs fp8 training — when does each fail or shine?
- What does dynamic loss scaling do, and on which hardware do you still need it?
Interview questions to prep
- Why does gradient accumulation let you simulate a larger batch size?
- When does gradient accumulation NOT match true large-batch training (e.g., BatchNorm)?
Interview questions to prep
- A 1B model trains well but a 70B version explodes at the same learning rate. What changed?
- Explain standard parameterization vs maximum update parameterization in interview terms.
- How does muP let you tune hyperparameters on a smaller model and transfer them to a larger one?
- When would gradient clipping hide the symptom but not fix the scaling problem?
Interview questions to prep
- How does activation checkpointing trade compute for memory?
- When would you offload optimizer state to CPU/NVMe?
References & further reading