Day 92 of 133

Distributed training: DDP, FSDP, ZeRO + DSA review

Gradient sync; sharding strategies; NCCL.

DSA · NeetCode Trees

  • Interview questions to prep

    1. Compare BFS vs DFS for this problem — which fits, and what's the iterative version?
    2. What's the recursion's space cost on the stack, and how would you go iterative if you needed O(log n)?
    3. What's the relationship between this problem's invariant and the BST property (if any)?

Infra · Distributed training

  • Interview questions to prep

    1. Walk me through how DDP synchronizes gradients across GPUs.
    2. What does NCCL do in this picture?
  • Interview questions to prep

    1. How does FSDP shard parameters, gradients, and optimizer state?
    2. Compare ZeRO-1, ZeRO-2, ZeRO-3 stages.
  • Interview questions to prep

    1. When do you need tensor parallelism vs pipeline parallelism, and how do they compose with data parallelism (3D parallelism)?
    2. What is pipeline-parallel bubble overhead, and how do you minimize it?

References & further reading