Day 102 of 133
RL basics: MDP, value, policy, exploration
Bellman; on-policy vs off-policy; Thompson sampling for bandits.
DSA · NeetCode Linked List
- Merge Two Sorted ListsDSA · Linked List
Interview questions to prep
- How would you generalize this to merging k sorted lists efficiently?
- Can you do it in-place without a dummy node? What's gained / lost?
Specialization · Reinforcement learning basics
Interview questions to prep
- Walk through the Bellman equation for the value function.
- What's the difference between value iteration and policy iteration?
Interview questions to prep
- Compare on-policy vs off-policy RL — examples of each algorithm class.
- When is sample efficiency a deciding factor between on-policy and off-policy methods?
Interview questions to prep
- Compare epsilon-greedy, UCB, and Thompson sampling in a contextual bandit.
- Why does Thompson sampling adapt better to non-stationary rewards than epsilon-greedy?
References & further reading
- OpenAI — Spinning Up in RL ↗OpenAI
- Papers with Code — SOTA leaderboards ↗Papers with Code