Day 103 of 133
Deep RL: DQN, policy gradients, PPO, GRPO
Replay buffer; clipped surrogate; DeepSeek-R1's GRPO.
DSA · NeetCode Trees
- Count Good Nodes IN Binary TreeDSA · Trees
Interview questions to prep
- Compare BFS vs DFS for this problem — which fits, and what's the iterative version?
- What's the recursion's space cost on the stack, and how would you go iterative if you needed O(log n)?
- What's the relationship between this problem's invariant and the BST property (if any)?
Specialization · Deep RL
Interview questions to prep
- Why does DQN need experience replay and a target network?
- What does Double DQN fix that vanilla DQN gets wrong, and why does it matter in practice?
Interview questions to prep
- Walk through the policy gradient theorem.
- Why are baselines used (e.g., advantage estimation)?
Interview questions to prep
- Walk me through PPO — what's the clipped surrogate objective and why does it stabilize training?
- How is GRPO (used in DeepSeek-R1) different from PPO?
References & further reading
- OpenAI — Spinning Up in RL ↗OpenAI
- Papers with Code — SOTA leaderboards ↗Papers with Code