Day 93 of 133

Training efficiency: mixed precision, accumulation, checkpointing

fp16 vs bf16 vs fp8; activation checkpointing math.

DSA · NeetCode Heap / Priority Queue

Kth Largest Element IN AN ArrayDSA · Heap / Priority Queue
Interview questions to prep
1. Compare heap (O(n log k)), sort (O(n log n)), quickselect (O(n) avg) — when does each fit?
2. What's quickselect's worst case, and how do you avoid it (median-of-medians, randomization)?

Mixed precision training (fp16, bf16, fp8)MLOpsPyTorch
Interview questions to prep
1. Compare fp16 vs bf16 vs fp8 training — when does each fail or shine?
2. What does dynamic loss scaling do, and on which hardware do you still need it?
Gradient accumulation & micro-batchingMLOpsHF
Interview questions to prep
1. Why does gradient accumulation let you simulate a larger batch size?
2. When does gradient accumulation NOT match true large-batch training (e.g., BatchNorm)?
Scale-stable training: muP and learning-rate transferMLOpsmuP
Interview questions to prep
1. A 1B model trains well but a 70B version explodes at the same learning rate. What changed?
2. Explain standard parameterization vs maximum update parameterization in interview terms.
3. How does muP let you tune hyperparameters on a smaller model and transfer them to a larger one?
4. When would gradient clipping hide the symptom but not fix the scaling problem?
Activation checkpointing & offloadingMLOpsPyTorch
Interview questions to prep
1. How does activation checkpointing trade compute for memory?
2. When would you offload optimizer state to CPU/NVMe?

References & further reading