Day 96 of 133

ML Infra consolidation + 3D parallelism review

Compose DP + TP + PP. Walk through Megatron-style training.

DSA · NeetCode 1-D DP

Word BreakDSA · 1-D DP
Interview questions to prep
1. State the DP: define the state, the transition, and the base case explicitly.
2. Top-down (memoized recursion) vs bottom-up (tabulation) — which is more natural here, and why?
3. Can you space-optimize from O(n) to O(1)? Show the rolling-window trick.

Data parallelism: DDPMLOpsPyTorch
Interview questions to prep
1. Walk me through how DDP synchronizes gradients across GPUs.
2. What does NCCL do in this picture?
FSDP, ZeRO, DeepSpeedMLOpsMeta
Interview questions to prep
1. How does FSDP shard parameters, gradients, and optimizer state?
2. Compare ZeRO-1, ZeRO-2, ZeRO-3 stages.
Tensor & pipeline parallelismMLOpsHF
Interview questions to prep
1. When do you need tensor parallelism vs pipeline parallelism, and how do they compose with data parallelism (3D parallelism)?
2. What is pipeline-parallel bubble overhead, and how do you minimize it?

Mixed precision training (fp16, bf16, fp8)MLOpsPyTorch
Interview questions to prep
1. Compare fp16 vs bf16 vs fp8 training — when does each fail or shine?
2. What does dynamic loss scaling do, and on which hardware do you still need it?
Gradient accumulation & micro-batchingMLOpsHF
Interview questions to prep
1. Why does gradient accumulation let you simulate a larger batch size?
2. When does gradient accumulation NOT match true large-batch training (e.g., BatchNorm)?
Activation checkpointing & offloadingMLOpsPyTorch
Interview questions to prep
1. How does activation checkpointing trade compute for memory?
2. When would you offload optimizer state to CPU/NVMe?

References & further reading