Why this exists
This system stops silent regressions. State the failure mode it prevents:
- A prompt change "to fix one issue" tanks unrelated tasks.
- A model upgrade improves average quality but breaks safety-sensitive cases.
- A retrieval change improves recall but degrades faithfulness.
Without a platform, every team rolls their own evals, no one trusts the numbers, and the release gate is "the PM said it looked fine."
The platform's contract: every LLM-powered feature ships only after passing a versioned evaluation suite, and every regression is attributable to a specific change.
Core abstractions
A defensible design names four primary objects and their relationships:
- Suite — a named collection of test cases scoped to a product (support agent, RAG chatbot, summarization).
- Test case —
(input, expected behavior, scoring rubric, tags). Behavior may be exact-match, semantic-match, or graded by rubric. - Run — an execution of a suite against a
(model, prompt-version, retrieval-config)triplet, with results stored. - Gate — a policy attached to a release pipeline that compares a new run against a baseline and blocks promotion on regression.
Mention versioning explicitly: prompts, models, retrieval configs, and rubrics are all immutable artifacts with content hashes. A run identifies them by hash, not by name.
Architecture
- Suite registry — UI + API for creating, editing, and tagging test cases. Backed by a database.
- Run executor — pulls test cases, calls the target LLM (or product surface), stores raw outputs. Parallelized and rate-limited per provider.
- Scoring layer — applies deterministic checks (regex, JSON schema, exact match) plus rubric-based judges (LLM-as-judge with calibrated prompts) plus retrieval-quality scores.
- Human review queue — sampled runs surface to reviewers for adjudication; reviewer scores feed back to calibrate the LLM-judge.
- Comparison dashboard — side-by-side diff of two runs, regression highlights, attribution back to the change that caused them.
- Release-gate API — CI calls this with the candidate's run ID; returns pass / fail and the regression details.
Scoring — the real complexity
LLM evaluation is hard because correctness is fuzzy. Walk through the scoring stack:
- Deterministic checks: schema validation, format compliance, banned phrases, citation presence. Fast and cheap; use these first.
- Reference-based: BLEU / ROUGE / exact match against a gold answer. Useful only for narrow tasks.
- Rubric-based LLM-judge: the most flexible. Score each output on dimensions (correctness, helpfulness, safety) using a graded rubric. Calibrate the judge against human ratings on a held-out set.
- Pairwise preference: judge prefers A over B. Often more reliable than absolute scoring.
- Human-in-the-loop: required for safety-sensitive tasks and for periodic calibration audits.
The architect signal is admitting that LLM-judge agreement with humans is typically 70-85% — meaningful, but not a substitute for sampled human review.
Regression detection
A regression is "the new run is statistically worse than baseline." Several flavors:
- Pass rate drop: more cases failing the deterministic checks.
- Rubric score drop: average rubric score is lower, with a confidence interval.
- Slice regression: overall metrics fine, but a tagged subset (e.g., "billing questions") got worse — often the most important kind.
- Cost / latency regression: not just quality. A prompt that doubles tokens for marginal quality is still a regression.
Show comparison with significance testing (paired bootstrap or McNemar) — a 2% difference on 50 cases is noise.
Multi-tenant — why this is a platform
Each product team has its own:
- suite of test cases
- prompt versions and models
- gate thresholds
- review queue for their reviewers
Mention namespace isolation, per-team budgets, and a shared LLM-judge pool that all teams calibrate against the same human-rated calibration set.
Failure modes worth naming
- Goodhart's law: teams over-fit to the eval suite. Counter by rotating a "hidden" subset of cases that aren't visible to the prompt authors.
- Stale gold: golden answers become wrong as products evolve. Schedule periodic re-validation by humans, not just versioning.
- Judge drift: when the judge model is updated, all historical scores become incomparable. Pin the judge model version per gate, upgrade explicitly.
- Cost explosion: rubric-based scoring with a frontier model is expensive. Use a cheaper judge for first-pass and escalate only on borderline cases.
- Slow feedback loop: a 4-hour eval suite kills iteration speed. Provide a fast "smoke" suite (~5 minutes) for PR-time feedback and the full suite for release gates.
Rollout
- Pilot with one team — usually the team with the most release pain. Build the full loop for their suite first.
- Onboard 2-3 more teams — each adds requirements that stress the platform. Resist temptation to special-case; generalize.
- Mandatory gate — leadership commits to "no LLM feature ships without an eval gate." This is a policy decision, not a tooling one.
- Cost budgets — per-team budgets on judge calls so a runaway suite doesn't blow up the platform's bill.
What the architect signal looks like
Close with the meta point: the platform's job is to make the cost of not evaluating higher than the cost of evaluating. Name the one cultural shift you would defend (no LLM feature ships without a regression-gated eval), and the one technical investment that has compounding value (a calibrated LLM-judge with quarterly human re-validation).