Day 88 of 133

Alignment: SFT, RLHF, DPO/IPO/KTO/ORPO + DSA review

RLHF three stages; reward hacking; why DPO is simpler.

DSA · NeetCode Heap / Priority Queue

  • Task SchedulerDSA · Heap / Priority Queue

    Interview questions to prep

    1. Why is a heap the right structure? Could a balanced BST or sorted list work — why is heap better?
    2. Explain the heap-of-k pattern: keep size k, push new, pop if over k. What's the resulting complexity?
    3. What does the comparator look like, and how would you tweak it to flip min/max behaviour?

GenAI · SFT, DPO, RLHF

  • Interview questions to prep

    1. Walk through the SFT objective — how is it different from pretraining?
    2. How do you choose the SFT learning rate and number of epochs without overfitting on a small dataset?
  • Interview questions to prep

    1. Walk through the three stages of RLHF: SFT → reward model → PPO.
    2. Why is reward hacking a problem in RLHF?
  • DPO and friends (IPO, KTO, ORPO)Generative AIRafailov et al.

    Interview questions to prep

    1. Why is DPO simpler than PPO-based RLHF, and what does it sacrifice?
    2. Compare DPO, IPO, KTO, and ORPO — when does each fit?

References & further reading