Day 88 of 133

Alignment: SFT, RLHF, DPO/IPO/KTO/ORPO + DSA review

RLHF three stages; reward hacking; why DPO is simpler.

DSA · NeetCode Heap / Priority Queue

Task SchedulerDSA · Heap / Priority Queue
Interview questions to prep
1. Why is a heap the right structure? Could a balanced BST or sorted list work — why is heap better?
2. Explain the heap-of-k pattern: keep size k, push new, pop if over k. What's the resulting complexity?
3. What does the comparator look like, and how would you tweak it to flip min/max behaviour?

Supervised fine-tuning (SFT)Generative AIHF
Interview questions to prep
1. Walk through the SFT objective — how is it different from pretraining?
2. How do you choose the SFT learning rate and number of epochs without overfitting on a small dataset?
RLHF: reward model + PPOGenerative AIHF
Interview questions to prep
1. Walk through the three stages of RLHF: SFT → reward model → PPO.
2. Why is reward hacking a problem in RLHF?
DPO and friends (IPO, KTO, ORPO)Generative AIRafailov et al.
Interview questions to prep
1. Why is DPO simpler than PPO-based RLHF, and what does it sacrifice?
2. Compare DPO, IPO, KTO, and ORPO — when does each fit?

References & further reading