Day 93 of 133

Training efficiency: mixed precision, accumulation, checkpointing

fp16 vs bf16 vs fp8; activation checkpointing math.

DSA · NeetCode Heap / Priority Queue

  • Kth Largest Element IN AN ArrayDSA · Heap / Priority Queue

    Interview questions to prep

    1. Compare heap (O(n log k)), sort (O(n log n)), quickselect (O(n) avg) — when does each fit?
    2. What's quickselect's worst case, and how do you avoid it (median-of-medians, randomization)?

Infra · Training efficiency

  • Interview questions to prep

    1. Compare fp16 vs bf16 vs fp8 training — when does each fail or shine?
    2. What does dynamic loss scaling do, and on which hardware do you still need it?
  • Interview questions to prep

    1. Why does gradient accumulation let you simulate a larger batch size?
    2. When does gradient accumulation NOT match true large-batch training (e.g., BatchNorm)?
  • Interview questions to prep

    1. A 1B model trains well but a 70B version explodes at the same learning rate. What changed?
    2. Explain standard parameterization vs maximum update parameterization in interview terms.
    3. How does muP let you tune hyperparameters on a smaller model and transfer them to a larger one?
    4. When would gradient clipping hide the symptom but not fix the scaling problem?
  • Interview questions to prep

    1. How does activation checkpointing trade compute for memory?
    2. When would you offload optimizer state to CPU/NVMe?

References & further reading