Day 12 of 133
Optimization for ML: SGD → Adam → AdamW
Convexity, momentum, adaptive optimizers, schedulers. The interview classic.
DSA · NeetCode Bit Manipulation
- Reverse BitsDSA · Bit Manipulation
Interview questions to prep
- Walk me through the bit trick used here, bit by bit on a small input.
- Why XOR / AND / shift specifically — what property of that operation does the problem exploit?
- What's the complexity in terms of bits (often O(32) → O(1)), and where could that break for big-int?
- Missing NumberDSA · Bit Manipulation
Interview questions to prep
- Walk me through the bit trick used here, bit by bit on a small input.
- Why XOR / AND / shift specifically — what property of that operation does the problem exploit?
- What's the complexity in terms of bits (often O(32) → O(1)), and where could that break for big-int?
Math · Optimization for ML
Interview questions to prep
- What does convexity guarantee for optimization?
- Are deep neural network losses convex? Why does SGD still work?
Interview questions to prep
- Compare batch GD, SGD, and mini-batch SGD — trade-offs in compute, noise, and convergence.
- Why does SGD with momentum converge faster than vanilla SGD?
Interview questions to prep
- Implement gradient descent for a simple squared-error objective and explain the update rule line by line.
- How would you debug a training run where gradient descent diverges after a few steps?
Interview questions to prep
- Compare Adam, AdamW, and SGD with momentum — which would you reach for first and why?
- Why is the AdamW correction important when using weight decay with adaptive optimizers?
- What's the role of learning-rate warmup and cosine schedules?
References & further reading