Day 96 of 133
ML Infra consolidation + 3D parallelism review
Compose DP + TP + PP. Walk through Megatron-style training.
DSA · NeetCode 1-D DP
- Word BreakDSA · 1-D DP
Interview questions to prep
- State the DP: define the state, the transition, and the base case explicitly.
- Top-down (memoized recursion) vs bottom-up (tabulation) — which is more natural here, and why?
- Can you space-optimize from O(n) to O(1)? Show the rolling-window trick.
Infra · Distributed training
Interview questions to prep
- Walk me through how DDP synchronizes gradients across GPUs.
- What does NCCL do in this picture?
Interview questions to prep
- How does FSDP shard parameters, gradients, and optimizer state?
- Compare ZeRO-1, ZeRO-2, ZeRO-3 stages.
Interview questions to prep
- When do you need tensor parallelism vs pipeline parallelism, and how do they compose with data parallelism (3D parallelism)?
- What is pipeline-parallel bubble overhead, and how do you minimize it?
Infra · Training efficiency
Interview questions to prep
- Compare fp16 vs bf16 vs fp8 training — when does each fail or shine?
- What does dynamic loss scaling do, and on which hardware do you still need it?
Interview questions to prep
- Why does gradient accumulation let you simulate a larger batch size?
- When does gradient accumulation NOT match true large-batch training (e.g., BatchNorm)?
Interview questions to prep
- How does activation checkpointing trade compute for memory?
- When would you offload optimizer state to CPU/NVMe?
References & further reading