Day 41 of 133

DL training tricks (clipping, accum, mixed precision, checkpointing) + DSA Trees

How real teams fit big models on small GPUs.

DSA · NeetCode Trees

  • Interview questions to prep

    1. Compare BFS vs DFS for this problem — which fits, and what's the iterative version?
    2. What's the recursion's space cost on the stack, and how would you go iterative if you needed O(log n)?
    3. What's the relationship between this problem's invariant and the BST property (if any)?
  • Interview questions to prep

    1. What does the recursion return vs what it updates globally? Why those two different things?
    2. What's the time and space complexity, and where does the space go?

DL · Training tricks that matter

  • Interview questions to prep

    1. When would you reach for gradient clipping?
    2. Why does gradient accumulation let you simulate a larger batch size?
  • Interview questions to prep

    1. Compare fp16 vs bf16 — why does bf16 matter for training stability?
    2. What is loss scaling and when do you still need it under bf16/fp8?
  • Interview questions to prep

    1. How does activation checkpointing trade compute for memory?
    2. When does activation checkpointing become NOT worth it — what's the typical compute overhead?
  • Interview questions to prep

    1. Implement a minimal PyTorch training loop for MNIST-style handwritten digits and name each required step.
    2. Why do PyTorch image tensors usually use channel-first shape NCHW instead of NHWC?
    3. What is the practical difference between a PyTorch tensor and a NumPy array during training?

References & further reading