Day 88 of 133
Alignment: SFT, RLHF, DPO/IPO/KTO/ORPO + DSA review
RLHF three stages; reward hacking; why DPO is simpler.
DSA · NeetCode Heap / Priority Queue
- Task SchedulerDSA · Heap / Priority Queue
Interview questions to prep
- Why is a heap the right structure? Could a balanced BST or sorted list work — why is heap better?
- Explain the heap-of-k pattern: keep size k, push new, pop if over k. What's the resulting complexity?
- What does the comparator look like, and how would you tweak it to flip min/max behaviour?
GenAI · SFT, DPO, RLHF
Interview questions to prep
- Walk through the SFT objective — how is it different from pretraining?
- How do you choose the SFT learning rate and number of epochs without overfitting on a small dataset?
Interview questions to prep
- Walk through the three stages of RLHF: SFT → reward model → PPO.
- Why is reward hacking a problem in RLHF?
Interview questions to prep
- Why is DPO simpler than PPO-based RLHF, and what does it sacrifice?
- Compare DPO, IPO, KTO, and ORPO — when does each fit?
References & further reading
- Hugging Face — LoRA & PEFT ↗Hugging Face
- Maxime Labonne — LLM Course ↗GitHub
- Anthropic — Building Effective Agents ↗Anthropic