Day 12 of 133

Optimization for ML: SGD → Adam → AdamW

Convexity, momentum, adaptive optimizers, schedulers. The interview classic.

DSA · NeetCode Bit Manipulation

Reverse BitsDSA · Bit Manipulation
Interview questions to prep
1. Walk me through the bit trick used here, bit by bit on a small input.
2. Why XOR / AND / shift specifically — what property of that operation does the problem exploit?
3. What's the complexity in terms of bits (often O(32) → O(1)), and where could that break for big-int?
Missing NumberDSA · Bit Manipulation
Interview questions to prep
1. Walk me through the bit trick used here, bit by bit on a small input.
2. Why XOR / AND / shift specifically — what property of that operation does the problem exploit?
3. What's the complexity in terms of bits (often O(32) → O(1)), and where could that break for big-int?

Convex vs non-convex optimizationStatisticsBoyd
Interview questions to prep
1. What does convexity guarantee for optimization?
2. Are deep neural network losses convex? Why does SGD still work?
Gradient descent, SGD, mini-batch SGDStatisticsSebastian Ruder
Interview questions to prep
1. Compare batch GD, SGD, and mini-batch SGD — trade-offs in compute, noise, and convergence.
2. Why does SGD with momentum converge faster than vanilla SGD?
Code gradient descent for a scalar lossStatisticsNeetCode ML
Interview questions to prep
1. Implement gradient descent for a simple squared-error objective and explain the update rule line by line.
2. How would you debug a training run where gradient descent diverges after a few steps?
Momentum, Nesterov, Adam, AdamW, schedulersStatisticsfast.ai
Interview questions to prep
1. Compare Adam, AdamW, and SGD with momentum — which would you reach for first and why?
2. Why is the AdamW correction important when using weight decay with adaptive optimizers?
3. What's the role of learning-rate warmup and cosine schedules?

References & further reading