Day 92 of 133

Distributed training: DDP, FSDP, ZeRO + DSA review

Gradient sync; sharding strategies; NCCL.

DSA · NeetCode Trees

Kth Smallest Element IN A BstDSA · Trees
Interview questions to prep
1. Compare BFS vs DFS for this problem — which fits, and what's the iterative version?
2. What's the recursion's space cost on the stack, and how would you go iterative if you needed O(log n)?
3. What's the relationship between this problem's invariant and the BST property (if any)?

Data parallelism: DDPMLOpsPyTorch
Interview questions to prep
1. Walk me through how DDP synchronizes gradients across GPUs.
2. What does NCCL do in this picture?
FSDP, ZeRO, DeepSpeedMLOpsMeta
Interview questions to prep
1. How does FSDP shard parameters, gradients, and optimizer state?
2. Compare ZeRO-1, ZeRO-2, ZeRO-3 stages.
Tensor & pipeline parallelismMLOpsHF
Interview questions to prep
1. When do you need tensor parallelism vs pipeline parallelism, and how do they compose with data parallelism (3D parallelism)?
2. What is pipeline-parallel bubble overhead, and how do you minimize it?

References & further reading