Day 77 of 133

LLM evaluation + benchmarks + LLM-as-judge + DSA Intervals

Perplexity vs MMLU vs GPQA vs SWE-bench; judge calibration.

DSA · NeetCode Intervals

Non Overlapping IntervalsDSA · Intervals
Interview questions to prep
1. Do you sort by start or by end? Defend the choice based on the invariant you need.
2. Walk through merge / overlap detection: what's your condition for 'overlapping'?
3. How does complexity break down: O(n log n) sort + O(n) sweep — can you do better in any case?
Meeting RoomsDSA · Intervals
Interview questions to prep
1. Do you sort by start or by end? Defend the choice based on the invariant you need.
2. Walk through merge / overlap detection: what's your condition for 'overlapping'?
3. How does complexity break down: O(n log n) sort + O(n) sweep — can you do better in any case?

Perplexity: what it is and isn'tGenerative AIHF
Interview questions to prep
1. Define perplexity precisely — and explain why it's a poor proxy for downstream task quality.
2. Two models with the same perplexity perform very differently on a benchmark. What's going on?
Benchmarks: MMLU, HumanEval, GPQA, SWE-benchGenerative AIPapers with Code
Interview questions to prep
1. Compare MMLU vs GPQA vs SWE-bench — what does each measure?
2. Why has benchmark contamination become a serious issue?
LLM-as-judge: design, biases, calibrationGenerative AIZheng et al.
Interview questions to prep
1. When is LLM-as-judge reliable, and what biases does it introduce?
2. How would you calibrate an LLM judge to human preferences?
Generation metrics: BLEU, ROUGE, BERTScore, exact matchGenerative AIHugging Face
Interview questions to prep
1. Compare BLEU, ROUGE, BERTScore, exact match, and perplexity — what does each miss?
2. Why are BLEU and ROUGE weak signals for open-ended assistant answers?
3. How would you combine automatic metrics, human review, and LLM-as-judge in one eval harness?
4. Where do HELM-style holistic evaluations fit alongside BLEU, ROUGE, MMLU, and task-specific business metrics?
5. How would you explain to an interviewer why a high ROUGE score can still produce a bad user-facing answer?

References & further reading