Day 92 of 133
Distributed training: DDP, FSDP, ZeRO + DSA review
Gradient sync; sharding strategies; NCCL.
DSA · NeetCode Trees
- Kth Smallest Element IN A BstDSA · Trees
Interview questions to prep
- Compare BFS vs DFS for this problem — which fits, and what's the iterative version?
- What's the recursion's space cost on the stack, and how would you go iterative if you needed O(log n)?
- What's the relationship between this problem's invariant and the BST property (if any)?
Infra · Distributed training
Interview questions to prep
- Walk me through how DDP synchronizes gradients across GPUs.
- What does NCCL do in this picture?
Interview questions to prep
- How does FSDP shard parameters, gradients, and optimizer state?
- Compare ZeRO-1, ZeRO-2, ZeRO-3 stages.
Interview questions to prep
- When do you need tensor parallelism vs pipeline parallelism, and how do they compose with data parallelism (3D parallelism)?
- What is pipeline-parallel bubble overhead, and how do you minimize it?
References & further reading