The Eval Bus Factor: When the Person Who Defined 'Correct' Walks Out the Door

May 13, 2026 · 10 min read

Software Engineer

A team I worked with recently lost their senior ML engineer. Two weeks later, the eval suite was still green on every PR — 847 cases, all passing, judge agreement at 92%. Six weeks later, a customer found a regression that should have been caught by the very first eval case in the support-quality bucket. When the team went to debug, nobody could explain why that case had been written, what failure mode it was supposed to catch, or why the judge prompt graded it on a 1–4 scale instead of binary. The case was still passing. It just wasn't testing anything anyone could name.

This is the eval bus factor: the silent failure mode where the person who decided what "correct" means for your AI feature was also the person who curated the test cases, calibrated the judge, and absorbed every implicit labeling tradeoff in their head. When they leave, the suite remains green but stops generating reliable promote/reject signal — because nobody else can extend it, debug a flaky judge, or evaluate whether a new failure mode belongs in the test set.

The reason this is so easy to miss is that traditional bus-factor metrics look at code authorship. The eval suite is code, so a SourceGraph query says "two engineers have committed; bus factor is 2." But the load-bearing artifact is not the code that runs the eval — it's the judgment encoded into the test cases and the judge prompt. That judgment lives in someone's head, and the commit history doesn't capture it.

Why eval ownership concentrates in one person

Evals concentrate ownership because grading AI output is one of the few engineering tasks where the correctness criterion itself is unstable. When you write a unit test for a function that adds two integers, the answer is in the spec; whoever wrote the spec, whoever writes the test, and whoever reviews the test will all agree on what passes. When you write an eval case for "the support agent gave a helpful response to a billing question," there is no spec. There is a person who decided, after looking at 200 real interactions, what "helpful" should mean for this product, this customer segment, and this regulatory context.

That decision rarely gets written down. It gets lived — the same person curates the test cases that exemplify the rule, picks the judge prompt that operationalizes it, sets the threshold above which a model is "good enough," and resolves edge cases by gut feel during weekly eval review. By the time the suite has a few hundred cases, the rule exists nowhere else in the organization. It is the cumulative residue of one person's grading.

This is doubly true for LLM-as-a-judge setups. The judge prompt looks like documentation, but it is not. It is a compressed approximation of the labeler's intent, and it only stays calibrated as long as someone is watching its outputs and noticing when it starts drifting. Researchers have shown that natural rubric refinement systematically biases judge preferences over time — what they call rubric-induced preference drift — which means even a well-written judge prompt is a moving target that needs an owner.

Three failure modes after the owner leaves

The damage is not immediate. Evals fail in slow, hard-to-detect ways:

Frozen coverage. The suite stops growing. New product surfaces ship without corresponding eval cases because nobody on the team feels qualified to decide what a representative case looks like, or what the right grading rubric would be. Coverage as a percentage of features actually shipped silently decays from 80% to 30% over a quarter.
Judge rot. The LLM-as-a-judge prompt was tuned against the original labeler's intuitions. As the upstream model changes, as the input distribution shifts, or as the labeler's own taste would have evolved, the judge produces verdicts that diverge from what the team would now consider correct. Nobody notices because there's no human re-grading sample to catch it.
Capability blindness. A new failure mode emerges — say, the model starts hallucinating product SKUs in a way it didn't six months ago. The eval suite has no case for it, because the rule for "what failure modes deserve a case" was the property of the person who left. New cases get added defensively after each customer complaint, but the suite is now reactive rather than predictive.

The pattern engineers describe most often is the third one: the team realizes, usually after a public incident, that they have been treating "the eval suite is green" as evidence the model is safe to ship, when in reality the suite was only ever evidence that yesterday's failure modes weren't recurring.

Why this is harder than a normal bus-factor problem

Standard bus-factor mitigations — pair programming, doc rotations, code reviews — don't fully translate. Code review on an eval-case PR can verify that the test runs, but it can't verify that the case represents a real failure mode worth catching, because the reviewer hasn't seen the underlying production data the labeler was reacting to. Documentation written by the original owner reads as obvious truth to them and as inscrutable trivia to everyone else.

There's a deeper issue. Eval ownership is what an academic might call taste-laden. The judgment "this output is too curt for our brand voice" or "this response is technically correct but legally risky" is not a rule that can be exhaustively codified. It's a calibrated intuition built from hundreds of small decisions. Succession-planning literature for subject matter experts is blunt about this: you cannot transfer tacit expertise through documentation alone. You transfer it by having the successor work alongside the expert long enough to make the same micro-decisions and get corrected.

Most AI teams discover this when the original owner has already left and a new hire is staring at 847 eval cases trying to reverse-engineer the philosophy that produced them. By then the cost is paid.

Practices that build redundancy before you need it

The teams that survive a key departure have usually adopted some version of the following four practices. None of them are surprising; what's notable is that they are usually only adopted after the first painful loss.

Codified label rubrics with worked examples. Every label dimension — "helpful," "safe," "compliant," "on-brand" — gets a written rubric with at least three positive and three negative worked examples drawn from real production data. The rubric is updated whenever the labeler resolves a hard edge case during eval review, and the resolved case goes into the worked-examples list with a one-line note explaining why. This is the only artifact that converts tacit grading judgment into something a successor can study.

Judge-prompt rationale notes. The LLM-as-a-judge prompt is checked into version control alongside a separate file explaining why each instruction in the prompt is there: which past failure mode it prevents, which rubric clause it operationalizes, what the alternative phrasings were and why they were rejected. This file is the difference between a successor being able to safely modify the judge and a successor being terrified to touch it.

Eval-review rotation. Weekly eval review — the meeting where someone looks at recent failures, decides whether they represent real bugs or judge errors, and updates cases accordingly — rotates among at least three engineers. Every rotation includes the original labeler as a silent observer for the first month, then as a consultant the labeler can ask questions of, then independently. This is the equivalent of a residency program for eval taste, and it is the practice that most directly shortens the recovery time after a departure.

Judge-agreement tracking with mandatory human re-grading. Five to ten percent of judge verdicts are re-graded by a human every week. The agreement rate is plotted over time. If it drifts downward, that is a leading indicator that either the judge is broken, the task has shifted, or the rubric has aged out of its original calibration. This works as a tripwire: even if the original owner left and nobody noticed the suite going stale, the agreement-drift chart will eventually scream.

Staffing it like the load-bearing asset it is

The deeper change is organizational. Most teams treat the eval set as something engineers maintain on the margin — twenty percent of one person's time, often the most ML-fluent engineer who happens to also be working on the model itself. This is the staffing pattern that produces a single point of failure.

The teams that recover quickly from a departure treat the eval set the way infrastructure teams treat a load-bearing service: it has a named primary owner, a named secondary owner, and an explicit succession plan. The secondary is not "in the loop" by happenstance — they actively co-author cases, attend every review, and are expected to be able to take over within a week's notice. When the primary takes a vacation, the secondary runs the weekly review. When the primary changes teams, the secondary becomes the primary and a new secondary is appointed before the handoff completes.

This sounds heavy until you compare it to the cost of the alternative. A green-but-meaningless eval suite is worse than no eval suite at all, because it gives the team false confidence to ship. The implicit budget for the eval set should be set against the cost of the next undetected regression, not against the engineer-hours it takes to maintain coverage.

What to do this quarter if you can't restructure your team

Most readers won't be able to immediately appoint co-owners and rotate reviews. The minimum viable intervention is smaller and worth doing this week:

Pick the five most-cited eval cases — the ones that have caught the most regressions or are most often referenced in promote/reject decisions. For each, write a one-paragraph rationale: what failure mode it tests, what real production incident motivated it, why the grading rubric is set the way it is. Check those rationales into the repo next to the cases.
Find the most opinionated section of your judge prompt — usually the part that distinguishes "good" from "great" or "acceptable" from "risky." Have the original author dictate, in plain language, what they were trying to capture. Save it as a comment in the prompt.
Schedule one hour with a second engineer where the eval owner walks through their last week's grading decisions out loud. Record it. The recording is not for compliance; it's for the next person who has to learn this taste.

These are the eval-suite equivalent of writing down the wifi password before going on vacation. They will not save you in a true emergency, but they will dramatically shorten the recovery if the emergency comes.

The forward-looking shift

The interesting structural change happening across AI engineering organizations in 2026 is that eval ownership is finally being treated as a discipline rather than a side effect of being the most ML-fluent person on the team. Some teams now have dedicated "evaluation engineers" whose entire job is to maintain the rubrics, train successors, and audit judge calibration. The role looks a lot like an SRE for measurement: a specialist who owns the reliability of the system that tells you whether your other systems are reliable.

If the bus factor of your eval suite is currently 1, you have an outage waiting to happen. The fix is not to write more cases. It is to make sure that the next time the person who knows what "correct" means walks out the door, the suite still has someone who can extend it, debug it, and recognize when it has stopped telling the truth.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Eval Bus Factor: When the Person Who Defined 'Correct' Walks Out the Door

Why eval ownership concentrates in one person

Three failure modes after the owner leaves

Why this is harder than a normal bus-factor problem

Practices that build redundancy before you need it

Staffing it like the load-bearing asset it is

What to do this quarter if you can't restructure your team

The forward-looking shift

Recommended Reading

About Tian Pan

Why eval ownership concentrates in one person​

Three failure modes after the owner leaves​

Why this is harder than a normal bus-factor problem​

Practices that build redundancy before you need it​

Staffing it like the load-bearing asset it is​

What to do this quarter if you can't restructure your team​

The forward-looking shift​

Recommended Reading

About Tian Pan

Why eval ownership concentrates in one person

Three failure modes after the owner leaves

Why this is harder than a normal bus-factor problem

Practices that build redundancy before you need it

Staffing it like the load-bearing asset it is

What to do this quarter if you can't restructure your team

The forward-looking shift