Day 102 of 133

RL basics: MDP, value, policy, exploration

Bellman; on-policy vs off-policy; Thompson sampling for bandits.

DSA · NeetCode Linked List

  • Merge Two Sorted ListsDSA · Linked List

    Interview questions to prep

    1. How would you generalize this to merging k sorted lists efficiently?
    2. Can you do it in-place without a dummy node? What's gained / lost?

Specialization · Reinforcement learning basics

  • Interview questions to prep

    1. Walk through the Bellman equation for the value function.
    2. What's the difference between value iteration and policy iteration?
  • On-policy vs off-policyDeep LearningOpenAI

    Interview questions to prep

    1. Compare on-policy vs off-policy RL — examples of each algorithm class.
    2. When is sample efficiency a deciding factor between on-policy and off-policy methods?
  • Interview questions to prep

    1. Compare epsilon-greedy, UCB, and Thompson sampling in a contextual bandit.
    2. Why does Thompson sampling adapt better to non-stationary rewards than epsilon-greedy?

References & further reading