Day 77 of 133

LLM evaluation + benchmarks + LLM-as-judge + DSA Intervals

Perplexity vs MMLU vs GPQA vs SWE-bench; judge calibration.

DSA · NeetCode Intervals

  • Interview questions to prep

    1. Do you sort by start or by end? Defend the choice based on the invariant you need.
    2. Walk through merge / overlap detection: what's your condition for 'overlapping'?
    3. How does complexity break down: O(n log n) sort + O(n) sweep — can you do better in any case?
  • Meeting RoomsDSA · Intervals

    Interview questions to prep

    1. Do you sort by start or by end? Defend the choice based on the invariant you need.
    2. Walk through merge / overlap detection: what's your condition for 'overlapping'?
    3. How does complexity break down: O(n log n) sort + O(n) sweep — can you do better in any case?

GenAI · LLM evaluation

  • Interview questions to prep

    1. Define perplexity precisely — and explain why it's a poor proxy for downstream task quality.
    2. Two models with the same perplexity perform very differently on a benchmark. What's going on?
  • Interview questions to prep

    1. Compare MMLU vs GPQA vs SWE-bench — what does each measure?
    2. Why has benchmark contamination become a serious issue?
  • Interview questions to prep

    1. When is LLM-as-judge reliable, and what biases does it introduce?
    2. How would you calibrate an LLM judge to human preferences?
  • Interview questions to prep

    1. Compare BLEU, ROUGE, BERTScore, exact match, and perplexity — what does each miss?
    2. Why are BLEU and ROUGE weak signals for open-ended assistant answers?
    3. How would you combine automatic metrics, human review, and LLM-as-judge in one eval harness?
    4. Where do HELM-style holistic evaluations fit alongside BLEU, ROUGE, MMLU, and task-specific business metrics?
    5. How would you explain to an interviewer why a high ROUGE score can still produce a bad user-facing answer?

References & further reading