Eval-as-Code: When Your Release Gate Is a Notebook on Someone's Laptop
The number that decides whether a model goes to production is being produced by a Jupyter notebook running on a single engineer's MacBook, against a CSV that lives in a Slack DM, scored by a judge model that nobody pinned. Two weeks later, after the engineer has touched the notebook three more times and the API provider has silently shipped a minor model update, nobody on the team can reproduce the number — including the engineer who originally generated it. And yet that number is the gate. It decided that GPT-4o-mini was good enough to replace GPT-4 in the customer support flow. It decided the new prompt template shipped. It decided the fine-tune was promoted. The team is treating it like a load-bearing artifact and storing it like a sticky note.
This is the eval gap. The industry has spent five years writing about evaluation as a methodology problem — which scoring technique, which judge model, which rubric, which dataset — and almost no time writing about evaluation as an engineering problem. But the moment your eval suite starts gating production releases, it inherits every requirement that the rest of your production stack lives by: reproducibility, version control, ownership, observability, dependency management, latency and reliability budgets, and a pipeline that survives the engineer who built it leaving the team. Most teams skip this layer entirely and discover its absence only after a major incident, usually one where the eval score said green and the customer experience said red.
The fix is not a better eval framework. The fix is treating the eval suite like the production system it has quietly become — applying the same engineering rigor to the pipeline that you apply to the service it gates. Call it eval-as-code. The phrase is intentionally borrowed from infrastructure-as-code: not because evals are infrastructure, but because the failure mode is the same. The thing that decides what gets deployed is itself an undocumented, unreproducible, manually-operated artifact, and the org has structurally underinvested in it because nobody put it on a roadmap.
The Notebook Problem Is Worse Than the Notebook
Engineers who hear "your eval is a notebook" reach for the obvious fix: rewrite it as a Python module, put it in the main repo, and call it done. That misses the deeper problem. The notebook is a symptom of a class of decisions the team made — implicitly, through neglect — about what the eval suite is for. Each of those decisions has to be made explicit before the rewrite means anything.
Who owns the eval? In most teams, the answer is "whoever wrote it" — usually the ML engineer or applied scientist who first proposed the feature. There is no CODEOWNERS entry. When that person rotates off the project, the eval becomes orphan code. New team members are afraid to touch it because they don't know which thresholds are load-bearing. The judge prompts are uncommented. Three months later, the eval still runs, but nobody on the current team can answer why a particular threshold is 0.72 and not 0.75.
Where does the eval live in the SDLC? Most teams answer this as "we run it before merging," which sounds like CI but isn't. CI means it runs automatically on every change, in a sandboxed environment, with reproducible inputs, against a known dataset, and produces a result the merge button reads. A laptop run is a manual ritual; the engineer remembers to do it most of the time. That's not a quality gate, that's a vibes-based release process with extra steps.
What changes when the eval changes? In production code, a behavior change goes through review, semantic versioning, and a deploy. Eval suites get edited in place. Yesterday the dataset had 200 examples; today it has 230 because someone added their favorite edge cases. The threshold that used to mean "we shipped" now means something different, and the historical scores in the dashboard are no longer comparable to the current ones. There's no eval v1.4.0 tag. There's no migration note. The dashboard shows a smooth line that's actually three different metrics in a trench coat.
These are the questions a serious eval-as-code investment answers. They are not technical questions; they are organizational ones, and the rewrite-the-notebook approach skips them.
Versioning Everything That Touches the Score
A reproducible eval is a function, in the mathematical sense: same inputs produce the same outputs. The reason eval suites lose this property is that the inputs are scattered across systems with different versioning models, and only some of them get pinned. The complete input list is longer than most teams expect.
The dataset itself needs a version. Not just a filename — a content hash, a commit, an immutable snapshot. "We re-run on the latest dataset" is a phrase that hides a regression: the new failures might not be model regressions at all, they might be the dataset getting harder. Without a version, you can't tell. The cheapest fix is committing the dataset to git as a versioned artifact and treating each addition as a PR. That feels heavyweight for the first few iterations and then pays for itself the first time someone asks "did the score drop because the model got worse, or because we added the new edge cases?"
The prompts under test need versions, with the same discipline. A prompt change is a behavior change, and a behavior change without a version is a Heisenbug factory. Several teams have written about treating prompts like code, and the consensus is converging: store them in git, give them semantic versions, reference them by version in deployment metadata, and never edit them in a UI that doesn't write back to source control. A prompt management UI is fine; a prompt management UI that is the source of truth is not.
The judge model needs a pin. This is the one teams forget most often, because the judge feels like infrastructure rather than a parameter. It isn't. The judge is a model with weights, and provider APIs change those weights without notice — sometimes silently, sometimes under a "minor update" label that doesn't roll the version string. Pinning to a dated model alias (claude-sonnet-4-6-20250122, not claude-sonnet-latest) is the minimum bar. The next bar up is running a small calibration set against the judge on a schedule and alerting when the judge's scores on a frozen reference set drift, because even pinned providers occasionally re-route inference.
The harness needs dependency pinning the same way any production service does — a lockfile, a fixed Python version, a containerized runtime. Eval pipelines that use whatever pandas the developer has installed will produce results that depend on whatever pandas the developer had installed. You will discover this the day the engineer upgrades and a tokenizer change shifts every score by 1%.
And the model parameters need to be in the version. Temperature, top-p, seed, max tokens, system prompt — all of it. These are not eval-pipeline configuration; they are part of the eval contract. A team that runs the same prompts at temperature 0.2 today and 0.0 tomorrow is producing two different metrics with the same name.
SLOs for the Pipeline That Decides What Ships
This is the part that most teams skip even when they have done everything else right. They have versioned the dataset and pinned the judge and CI-gated the eval — and then they treat the pipeline itself as best-effort infrastructure. It runs in a corner of GitHub Actions on a free tier. When it times out, the dev team retries it. When it fails, someone manually re-kicks it. When the judge API is rate-limited at 3 a.m. before the morning release, the eval just doesn't run, and somebody approves the release based on what it scored last week.
A release gate without an SLO is not a release gate. It's a suggestion. The eval pipeline needs the same uptime, latency, and reliability targets that the service it gates has. A practical first cut:
- Latency: P95 wall-clock time from PR open to eval result, with an explicit budget. If the suite takes 90 minutes, engineers will route around it.
- Reliability: Success rate of the pipeline itself (eval ran end-to-end and produced a score), separate from the eval result. A flaky pipeline that fails 15% of the time is producing 15% silent regressions.
- Throughput: How many eval runs per hour the pipeline can sustain at peak. Release windows produce bursts. The pipeline that handles steady state breaks during launch week.
- Freshness: How quickly a production failure can become a new test case in the suite. This one is squishy but real — a six-week lag means production has been running on a worse-than-current eval for six weeks.
These are SLOs, and they need monitoring and alerting like any other SLO. A dashboard that shows the eval run history, the pipeline duration trend, and the judge cost per run is not a nice-to-have. It's how you discover that your eval suite has been slowly degrading for two months while everyone was looking at model metrics.
The corollary: the eval pipeline needs an on-call story. Not a 24/7 rotation, but a clear answer to "the eval pipeline is broken and a release is blocked, who fixes it?" If the answer is "the one engineer who wrote it," the system is one vacation away from a release freeze.
Flaky Evals Are a Pipeline Bug, Not an Eval Result
Stochastic models produce stochastic outputs, which means a single eval run is a sample, not a measurement. Teams that treat it as a measurement build flake into their gate. The pattern is: a PR that didn't change anything fails the eval; the engineer re-runs it; it passes; the PR ships. The team learns that a failing eval is sometimes noise, and the next time it fails for a real reason, somebody re-runs it twice and ships anyway.
The fix is at the pipeline level, not the eval level. Eval-as-code treats run-level variance as a measurement problem with known solutions. Sample size: run each eval multiple times and aggregate. Variance reporting: surface the standard deviation alongside the mean, and gate on confidence intervals rather than point estimates. Seeded determinism where the model API supports it (OpenAI's seed parameter and similar) gets you most of the way to reproducibility for unit-level evals, even if it isn't a perfect lock — research has shown that GPU count, hardware, and batch size can still shift outputs by several percent across runs of the same seeded prompt. That's a real ceiling: you can drive eval variance down, but you cannot drive it to zero, and your gate logic needs to accept that.
The right gate is a statistical one: "the score regressed by more than X with confidence Y." The wrong gate is "the score went down by 0.01 from yesterday." Teams that ship the wrong gate spend their time chasing noise; teams that ship the right one spend their time chasing signal. The difference is a few hours of work in the harness and the willingness to commit to a number that has error bars on it.
The Cost Asymmetry That Sinks Eval Investment
There is one budget conversation that derails almost every eval-as-code initiative. The team proposes investing two weeks of engineering in eval infrastructure. Leadership asks what the return is. The honest answer — "we will be able to trust our quality gate, ship faster with fewer regressions, and reproduce historical results" — sounds soft compared to "we will ship the new feature." The team trades the eval work for feature work, and the eval suite stays a notebook for another quarter.
This pattern repeats until an incident reorders the priority list. The shape of that incident is consistent across teams: a production regression that the eval suite did not catch, traced after the fact to a judge model the team didn't realize had changed, or a prompt edit that didn't get a version bump, or a dataset that had drifted away from production traffic. Post-incident, the eval-as-code budget appears immediately. The cost of that incident is, in nearly every case, a multiple of what the proactive investment would have been.
The budget framing that gets the work funded earlier is not "eval infrastructure investment." It's "release-gate reliability." The leadership that won't fund the first will fund the second, because release-gate reliability is a phrase they have heard for traditional services and they know what happens when it fails. Use the language of the SRE function, not the language of the ML function. Eval pipelines are SRE-grade infrastructure that happens to be operated by ML engineers.
What Good Looks Like
A team that has done this well has a few visible markers. The eval suite is a directory in the main service repo, not a separate project, and it has a CODEOWNERS entry naming a real team. The dataset is versioned in git, with a changelog of additions. Every eval run is reproducible from a commit hash; the harness writes the commit, the model version, the judge version, the dataset version, the seed, and the temperature into the result record. Eval failures show up as comments on PRs, with a link to the run that produced them. The pipeline has a status page. There is a quarterly review where the team looks at the eval suite the same way they review their service SLOs, and the review answers questions like "is our eval throughput keeping up with our PR volume" and "has our judge cost per run shifted in a way we should address."
None of this is exotic. All of it is what production teams already do for the services they ship. The only thing distinctive about applying it to evals is that most teams haven't yet, because the eval suite started its life as someone's notebook and never graduated. Eval-as-code is the graduation ceremony. The work is unglamorous, and the result is that the number on the release gate becomes a number you can actually defend — to your team, to your auditors, and to the customer whose support ticket your model just answered. The alternative is a release process where the most consequential decision in the pipeline is made by an artifact that nobody owns, nobody has versioned, and nobody can re-run.
- https://arize.com/blog/how-to-add-llm-evaluations-to-ci-cd-pipelines/
- https://www.promptfoo.dev/docs/integrations/ci-cd/
- https://newsletter.pragmaticengineer.com/p/evals
- https://latitude.so/blog/ultimate-ci-cd-llm-evaluation-guide
- https://www.langchain.com/articles/llm-evals
- https://www.datadoghq.com/blog/llm-evaluation-framework-best-practices/
- https://www.braintrust.dev/articles/best-ai-evals-tools-cicd-2025
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://galileo.ai/blog/why-llm-as-a-judge-fails
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://www.confident-ai.com/blog/how-i-built-deterministic-llm-evaluation-metrics-for-deepeval
- https://www.kinde.com/learn/ai-for-software-engineering/ai-devops/ci-cd-for-evals-running-prompt-and-agent-regression-tests-in-github-actions/
- https://launchdarkly.com/blog/prompt-versioning-and-management/
