Design an LLM Evaluation Platform | ML Interview Roadmap

Why this exists

This system stops silent regressions. State the failure mode it prevents:

A prompt change "to fix one issue" tanks unrelated tasks.
A model upgrade improves average quality but breaks safety-sensitive cases.
A retrieval change improves recall but degrades faithfulness.

Without a platform, every team rolls their own evals, no one trusts the numbers, and the release gate is "the PM said it looked fine."

The platform's contract: every LLM-powered feature ships only after passing a versioned evaluation suite, and every regression is attributable to a specific change.

Core abstractions

A defensible design names four primary objects and their relationships:

Suite — a named collection of test cases scoped to a product (support agent, RAG chatbot, summarization).
Test case — (input, expected behavior, scoring rubric, tags). Behavior may be exact-match, semantic-match, or graded by rubric.
Run — an execution of a suite against a (model, prompt-version, retrieval-config) triplet, with results stored.
Gate — a policy attached to a release pipeline that compares a new run against a baseline and blocks promotion on regression.

Mention versioning explicitly: prompts, models, retrieval configs, and rubrics are all immutable artifacts with content hashes. A run identifies them by hash, not by name.

Architecture

Suite registry — UI + API for creating, editing, and tagging test cases. Backed by a database.
Run executor — pulls test cases, calls the target LLM (or product surface), stores raw outputs. Parallelized and rate-limited per provider.
Scoring layer — applies deterministic checks (regex, JSON schema, exact match) plus rubric-based judges (LLM-as-judge with calibrated prompts) plus retrieval-quality scores.
Human review queue — sampled runs surface to reviewers for adjudication; reviewer scores feed back to calibrate the LLM-judge.
Comparison dashboard — side-by-side diff of two runs, regression highlights, attribution back to the change that caused them.
Release-gate API — CI calls this with the candidate's run ID; returns pass / fail and the regression details.

Scoring — the real complexity

LLM evaluation is hard because correctness is fuzzy. Walk through the scoring stack:

Deterministic checks: schema validation, format compliance, banned phrases, citation presence. Fast and cheap; use these first.
Reference-based: BLEU / ROUGE / exact match against a gold answer. Useful only for narrow tasks.
Rubric-based LLM-judge: the most flexible. Score each output on dimensions (correctness, helpfulness, safety) using a graded rubric. Calibrate the judge against human ratings on a held-out set.
Pairwise preference: judge prefers A over B. Often more reliable than absolute scoring.
Human-in-the-loop: required for safety-sensitive tasks and for periodic calibration audits.

The architect signal is admitting that LLM-judge agreement with humans is typically 70-85% — meaningful, but not a substitute for sampled human review.

Regression detection

A regression is "the new run is statistically worse than baseline." Several flavors:

Pass rate drop: more cases failing the deterministic checks.
Rubric score drop: average rubric score is lower, with a confidence interval.
Slice regression: overall metrics fine, but a tagged subset (e.g., "billing questions") got worse — often the most important kind.
Cost / latency regression: not just quality. A prompt that doubles tokens for marginal quality is still a regression.

Show comparison with significance testing (paired bootstrap or McNemar) — a 2% difference on 50 cases is noise.

Multi-tenant — why this is a platform

Each product team has its own:

suite of test cases
prompt versions and models
gate thresholds
review queue for their reviewers

Mention namespace isolation, per-team budgets, and a shared LLM-judge pool that all teams calibrate against the same human-rated calibration set.

Failure modes worth naming

Goodhart's law: teams over-fit to the eval suite. Counter by rotating a "hidden" subset of cases that aren't visible to the prompt authors.
Stale gold: golden answers become wrong as products evolve. Schedule periodic re-validation by humans, not just versioning.
Judge drift: when the judge model is updated, all historical scores become incomparable. Pin the judge model version per gate, upgrade explicitly.
Cost explosion: rubric-based scoring with a frontier model is expensive. Use a cheaper judge for first-pass and escalate only on borderline cases.
Slow feedback loop: a 4-hour eval suite kills iteration speed. Provide a fast "smoke" suite (~5 minutes) for PR-time feedback and the full suite for release gates.

Rollout

Pilot with one team — usually the team with the most release pain. Build the full loop for their suite first.
Onboard 2-3 more teams — each adds requirements that stress the platform. Resist temptation to special-case; generalize.
Mandatory gate — leadership commits to "no LLM feature ships without an eval gate." This is a policy decision, not a tooling one.
Cost budgets — per-team budgets on judge calls so a runaway suite doesn't blow up the platform's bill.

What the architect signal looks like

Close with the meta point: the platform's job is to make the cost of not evaluating higher than the cost of evaluating. Name the one cultural shift you would defend (no LLM feature ships without a regression-gated eval), and the one technical investment that has compounding value (a calibrated LLM-judge with quarterly human re-validation).