Day 12 of 133

Optimization for ML: SGD → Adam → AdamW

Convexity, momentum, adaptive optimizers, schedulers. The interview classic.

DSA · NeetCode Bit Manipulation

  • Reverse BitsDSA · Bit Manipulation

    Interview questions to prep

    1. Walk me through the bit trick used here, bit by bit on a small input.
    2. Why XOR / AND / shift specifically — what property of that operation does the problem exploit?
    3. What's the complexity in terms of bits (often O(32) → O(1)), and where could that break for big-int?
  • Missing NumberDSA · Bit Manipulation

    Interview questions to prep

    1. Walk me through the bit trick used here, bit by bit on a small input.
    2. Why XOR / AND / shift specifically — what property of that operation does the problem exploit?
    3. What's the complexity in terms of bits (often O(32) → O(1)), and where could that break for big-int?

Math · Optimization for ML

  • Interview questions to prep

    1. What does convexity guarantee for optimization?
    2. Are deep neural network losses convex? Why does SGD still work?
  • Gradient descent, SGD, mini-batch SGDStatisticsSebastian Ruder

    Interview questions to prep

    1. Compare batch GD, SGD, and mini-batch SGD — trade-offs in compute, noise, and convergence.
    2. Why does SGD with momentum converge faster than vanilla SGD?
  • Interview questions to prep

    1. Implement gradient descent for a simple squared-error objective and explain the update rule line by line.
    2. How would you debug a training run where gradient descent diverges after a few steps?
  • Interview questions to prep

    1. Compare Adam, AdamW, and SGD with momentum — which would you reach for first and why?
    2. Why is the AdamW correction important when using weight decay with adaptive optimizers?
    3. What's the role of learning-rate warmup and cosine schedules?

References & further reading