Day 103 of 133

Deep RL: DQN, policy gradients, PPO, GRPO

Replay buffer; clipped surrogate; DeepSeek-R1's GRPO.

DSA · NeetCode Trees

Count Good Nodes IN Binary TreeDSA · Trees
Interview questions to prep
1. Compare BFS vs DFS for this problem — which fits, and what's the iterative version?
2. What's the recursion's space cost on the stack, and how would you go iterative if you needed O(log n)?
3. What's the relationship between this problem's invariant and the BST property (if any)?

DQN: experience replay, target netsDeep LearningDeepMind
Interview questions to prep
1. Why does DQN need experience replay and a target network?
2. What does Double DQN fix that vanilla DQN gets wrong, and why does it matter in practice?
Policy gradients & REINFORCEDeep LearningLilian Weng
Interview questions to prep
1. Walk through the policy gradient theorem.
2. Why are baselines used (e.g., advantage estimation)?
Actor-critic, PPO, GRPODeep LearningSchulman et al.
Interview questions to prep
1. Walk me through PPO — what's the clipped surrogate objective and why does it stabilize training?
2. How is GRPO (used in DeepSeek-R1) different from PPO?

References & further reading