Day 77 of 133
LLM evaluation + benchmarks + LLM-as-judge + DSA Intervals
Perplexity vs MMLU vs GPQA vs SWE-bench; judge calibration.
DSA · NeetCode Intervals
- Non Overlapping IntervalsDSA · Intervals
Interview questions to prep
- Do you sort by start or by end? Defend the choice based on the invariant you need.
- Walk through merge / overlap detection: what's your condition for 'overlapping'?
- How does complexity break down: O(n log n) sort + O(n) sweep — can you do better in any case?
- Meeting RoomsDSA · Intervals
Interview questions to prep
- Do you sort by start or by end? Defend the choice based on the invariant you need.
- Walk through merge / overlap detection: what's your condition for 'overlapping'?
- How does complexity break down: O(n log n) sort + O(n) sweep — can you do better in any case?
GenAI · LLM evaluation
Interview questions to prep
- Define perplexity precisely — and explain why it's a poor proxy for downstream task quality.
- Two models with the same perplexity perform very differently on a benchmark. What's going on?
Interview questions to prep
- Compare MMLU vs GPQA vs SWE-bench — what does each measure?
- Why has benchmark contamination become a serious issue?
Interview questions to prep
- When is LLM-as-judge reliable, and what biases does it introduce?
- How would you calibrate an LLM judge to human preferences?
Interview questions to prep
- Compare BLEU, ROUGE, BERTScore, exact match, and perplexity — what does each miss?
- Why are BLEU and ROUGE weak signals for open-ended assistant answers?
- How would you combine automatic metrics, human review, and LLM-as-judge in one eval harness?
- Where do HELM-style holistic evaluations fit alongside BLEU, ROUGE, MMLU, and task-specific business metrics?
- How would you explain to an interviewer why a high ROUGE score can still produce a bad user-facing answer?
References & further reading
- Papers with Code — SOTA leaderboards ↗Papers with Code
- Anthropic — Testing & Evaluation ↗Anthropic
- 75Hard GenAI/LLM — LLM evaluation metrics ↗75Hard GenAI/LLM Challenge