Day 96 of 133

ML Infra consolidation + 3D parallelism review

Compose DP + TP + PP. Walk through Megatron-style training.

DSA · NeetCode 1-D DP

  • Word BreakDSA · 1-D DP

    Interview questions to prep

    1. State the DP: define the state, the transition, and the base case explicitly.
    2. Top-down (memoized recursion) vs bottom-up (tabulation) — which is more natural here, and why?
    3. Can you space-optimize from O(n) to O(1)? Show the rolling-window trick.

Infra · Distributed training

  • Interview questions to prep

    1. Walk me through how DDP synchronizes gradients across GPUs.
    2. What does NCCL do in this picture?
  • Interview questions to prep

    1. How does FSDP shard parameters, gradients, and optimizer state?
    2. Compare ZeRO-1, ZeRO-2, ZeRO-3 stages.
  • Interview questions to prep

    1. When do you need tensor parallelism vs pipeline parallelism, and how do they compose with data parallelism (3D parallelism)?
    2. What is pipeline-parallel bubble overhead, and how do you minimize it?

Infra · Training efficiency

  • Interview questions to prep

    1. Compare fp16 vs bf16 vs fp8 training — when does each fail or shine?
    2. What does dynamic loss scaling do, and on which hardware do you still need it?
  • Interview questions to prep

    1. Why does gradient accumulation let you simulate a larger batch size?
    2. When does gradient accumulation NOT match true large-batch training (e.g., BatchNorm)?
  • Interview questions to prep

    1. How does activation checkpointing trade compute for memory?
    2. When would you offload optimizer state to CPU/NVMe?

References & further reading