Day 102 of 133

RL basics: MDP, value, policy, exploration

Bellman; on-policy vs off-policy; Thompson sampling for bandits.

DSA · NeetCode Linked List

Merge Two Sorted ListsDSA · Linked List
Interview questions to prep
1. How would you generalize this to merging k sorted lists efficiently?
2. Can you do it in-place without a dummy node? What's gained / lost?

MDPs, value, policy, Bellman equationDeep LearningOpenAI
Interview questions to prep
1. Walk through the Bellman equation for the value function.
2. What's the difference between value iteration and policy iteration?
On-policy vs off-policyDeep LearningOpenAI
Interview questions to prep
1. Compare on-policy vs off-policy RL — examples of each algorithm class.
2. When is sample efficiency a deciding factor between on-policy and off-policy methods?
Exploration: epsilon-greedy, UCB, Thompson samplingDeep LearningLilian Weng
Interview questions to prep
1. Compare epsilon-greedy, UCB, and Thompson sampling in a contextual bandit.
2. Why does Thompson sampling adapt better to non-stationary rewards than epsilon-greedy?

References & further reading