Building an eval harness that actually catches regressions
2025-05-22 / 2 min / llm / evals / rag / production
Retrieval and prompt evaluation pipelines that drove an 18% relative lift in rubric pass rate over the prior eval harness, measured on production-derived canary sets. Plus why most eval setups silently lie to you.
The problem with most LLM evals
Teams ship an LLM feature. Their offline benchmark numbers go up. Then production users start complaining. The usual culprit is an eval suite that grades on the wrong axis: average correctness on a static set, not the long-tail regressions that appear when retrieval, prompt, or model versions drift. (This is one of the five things that have to be built before an AI feature stops being a demo.)
If your eval pipeline gives you a single green number per release, it is almost certainly hiding the regressions you most need to see.
What we built
A harness that scored every release across three independent axes. Retrieval grounding: did the model use evidence that was actually retrieved. Prompt-following: did it obey constraints on out-of-distribution queries. Rubric-based correctness: a versioned gold set graded by a stronger model with sample human spot-checks.
Each axis ran against the same canary set so regressions could not hide behind a flattering mean. Scoring traces were written to a queryable store with the inputs, outputs, retrieved documents, and grader rationale for every example.
Critically, the queryable store let us ask "which queries got worse between v23 and v24" instead of "did the average go up". That single shift in question was the biggest unlock.
Why it moved the needle
The 18% correctness lift was measured as a relative improvement in rubric pass rate on the production-derived canary sets used by those deployments, not as a universal benchmark score. It did not come from a smarter prompt. It came from being able to see, in minutes, which kinds of queries each iteration was breaking. Roughly half the prompt and retrieval changes we previously shipped would have been reverted if we had been measuring this way from the start.
We also caught a class of silent regression where retrieval quality dropped but generation papered over it with plausible-sounding output. Without scoring retrieval and generation separately, that kind of failure is invisible.
What I would do differently
Build the eval pipeline before the first feature ships. Otherwise every team rationalises "we will add evals once the prompt stabilises". It never stabilises.
Treat the gold set as a living artefact. Stale gold sets reward overfitting to the past. We added a weekly process where a small number of new examples were curated from production failures, with a 60-day rotation on the oldest examples.
If your team is shipping LLM features and your offline numbers look green while production users complain, this is the kind of work I take on. Send a brief.
Read next
- An AI underwriting assistant adopted by a 120-person credit operation in 10 weeks
Not a model demo. A workflow tool the credit team actually opened every morning. Built in 10 weeks, took manual review off the top decile of cases, and saved roughly five minutes of handling time per accepted draft against the pre-launch six-minute baseline. Here is how it shipped without an LLM-replaces-humans pitch.
- Routing inference across LLM providers without breaking latency
An orchestration layer that picks the right provider per request. 28% lower provider/API spend against the prior single-provider baseline, normalised for request volume and token mix. p95 latency stayed sub-second. Caller code never changed.
