Skip to main content

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

· 10 min read
Tian Pan
Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

Why Your Existing CI Doesn't Catch LLM Regressions

The unit test trap runs deep. Teams write evaluations that check whether a given prompt produces a response that scores above some threshold on some metric — helpfulness, faithfulness, relevance — and they run those evals in CI. If the score is above the threshold, the build passes. This feels rigorous. It mostly isn't.

The first problem is that individual evals are point estimates, not distributions. A single run of an LLM eval is noisy — the same prompt produces different outputs each time, and the judge model scoring those outputs adds another layer of noise. A single pass/fail on a single sample tells you almost nothing about whether behavior has degraded across the population of real inputs.

The second problem is that most eval suites test happy paths. Engineers write test cases based on what they expect the model to handle well. The failure modes they care about — malformed inputs, adversarial queries, edge cases the training data didn't anticipate — get systematically underrepresented. The eval suite passes confidently while the edge cases that will actually break in production remain untested.

The third problem is integration coverage. An agent that passes all its unit evals can still fail catastrophically when it interacts with real external systems: APIs that return unexpected schema, databases with messy real-world data, timeouts that cause the agent to spiral. Mocking those dependencies gives you a clean green build and a false sense of security. The data is the control flow — when you mock perfect responses, you only test the happy path.

The Minimal Viable Deployment Gate

Before adding agentic CI steps, you need a baseline that actually catches regressions. That baseline has three components:

A golden dataset curated from production traffic. Not synthetic examples you made up — real inputs your system has already handled, annotated with what "good" looks like. The sample should include edge cases and failure-adjacent inputs, not just the inputs where your model shines. Fifty real examples curated carefully beats five hundred synthetic ones generated from a prompt.

A rubric, not a vague question. Evaluation rubrics like "Was this a good answer?" produce noisy, unreliable scores. A useful rubric specifies dimensions (accuracy, completeness, tone adherence, citation correctness) and defines what each score level means concretely. The rubric is a product decision, not an engineering one — get the team that defines "done" to sign off on it.

A comparison baseline, not an absolute threshold. Instead of asking "did this output score above 0.8?", ask "did this output score meaningfully lower than the current production prompt?" A regression is defined relative to what you're already shipping, which is a much more stable signal than an absolute threshold calibrated when your dataset was different and your rubric was newer.

With these three in place, you can gate a deployment: run the eval on both the candidate version and the current production version, compute the delta, and fail the build if the new version scores statistically worse across the distribution.

Adding Agentic Steps: Where They Actually Help

Once the baseline eval gate is working, there's real value in adding agentic CI steps for specific problem classes. The key is being specific about what the agent is supposed to do — "have an AI look at this" is not a CI step.

Exploratory regression hunting. A human writing a test dataset is limited by their imagination. An agent can generate adversarial inputs targeting specific behaviors: boundary conditions, prompt injection attempts, inputs designed to elicit refusals, inputs that historically caused errors. The agent doesn't run in isolation — it generates test cases that feed into the same eval pipeline you already have. The value is expanding coverage into areas humans don't think to test.

Prompt change diff analysis. When a developer modifies a system prompt, the change is often structurally small but semantically significant — rewording a constraint, adding an instruction, changing an example. An agent step can analyze the diff between prompt versions, identify which behaviors might shift, and suggest targeted test inputs that exercise the changed region. This turns a generic eval run into one focused on what actually changed.

Output distribution shift detection. Beyond pass/fail on individual test cases, you can run an agent step that clusters outputs from the current and candidate versions and flags cases where the distributions diverge significantly — even if both versions score acceptably on the standard rubric. A prompt change that preserves average quality but increases variance in tone or length is a problem that aggregate scores won't catch.

Integration smoke tests with real dependencies. For agentic features that call external APIs or query databases, a CI step that runs the agent against a sandboxed copy of real infrastructure catches an entire class of failures that unit evals miss. The sandbox doesn't need to be a full production replica — it needs to be realistic enough that the agent encounters real response structures, real schema variations, and real error conditions.

Cost Controls You Cannot Skip

Agentic CI steps have a failure mode that doesn't exist in traditional pipelines: they can be expensive to run, and "expensive" scales with every merge. A CI step that costs 0.50perrunsoundstrivialuntilyourteamismergingtwentytimesaday.Thats0.50 per run sounds trivial until your team is merging twenty times a day. That's 3,000 a month in CI costs for one eval step.

The practical controls:

Sample, don't exhaust. Your golden dataset might have five hundred examples, but you don't need to run all five hundred on every commit. Run the full suite on scheduled nightly builds and on merges to main; run a stratified sample of fifty on every PR. The stratified sample should include your hardest cases — if a regression is real, it will show up in the hard cases first.

Tiered judge models. Use a cheap, fast model for the first pass and only escalate to expensive judges when the cheap judge flags a potential issue. A GPT-3.5-equivalent model can screen for obvious regressions — coherence failures, format violations, length outliers — at a fraction of the cost of a frontier model. Only run the expensive judge on the flagged subset.

Budget hard limits at the pipeline level. Every LLM API call in CI should route through a layer that enforces a per-run cost ceiling. If a CI step hits the ceiling, it fails the step with a cost-overrun error rather than silently continuing. This prevents runaway eval loops from an agent that generates inputs aggressively or retries failed calls indefinitely.

Cache aggressively. Eval steps run against the same golden dataset repeatedly. The results for a prompt version that hasn't changed can be cached. If neither the model, the prompt, nor the test input changed since the last run, the eval result from the previous run is still valid. Even a simple file-based cache keyed on a hash of (prompt_version, model_id, input) cuts costs significantly.

Failure-Mode Guardrails

The failure modes in agentic CI are different from traditional test failures and require different handling.

Flaky evals are not flaky tests. A traditional test that fails intermittently is a bug to fix. An LLM eval that fails intermittently might be signal — the model is near the threshold, output quality is inconsistent, and you should investigate rather than retry until it passes. Don't treat eval variance as flakiness to suppress. Treat it as information.

The judge can be wrong. LLM-as-a-judge evaluations are better than nothing and worse than human review. The judge has its own biases — it tends to favor longer answers, outputs from the same model family as itself, and responses that sound confident regardless of accuracy. Build sanity checks into the pipeline: if your judge scores every output above 0.9, the rubric is too easy or the judge is miscalibrated.

Gate on what you can measure, not what you wish you could measure. It's tempting to try to gate on abstract qualities like "reasoning quality" or "user satisfaction." These don't translate into reliable CI metrics. Gate on things you can measure consistently: format compliance, refusal rate on adversarial inputs, factual accuracy on questions with known ground truth, response latency distribution. Abstract qualities belong in periodic human review cycles, not automated deployment gates.

Soft blocks over hard blocks for new signals. When you introduce a new eval metric into the pipeline, run it in observation mode for a few weeks before making it a hard gate. This lets you calibrate thresholds against real-world merge patterns and catch cases where the metric fires on legitimate changes. Rushing a new metric straight to a hard gate often results in the team disabling it after the first false positive.

The Integration Pattern That Works

The architecture that holds up in practice looks like this: standard linting and type-checking run first as they always have. Then a lightweight LLM smoke test runs on a small sample — ten to twenty inputs from the golden dataset — to catch obvious catastrophic failures cheaply. If the smoke test passes, a more comprehensive eval runs on a stratified sample with a budget ceiling. The comprehensive eval computes a delta against the production baseline and fails the build if the delta exceeds a threshold. Finally, for PRs that change system prompts, an agent step runs a targeted analysis of the prompt diff and generates a small batch of focused test inputs that feed into the eval.

The total cost for this pipeline, with tiered judges and sampling, runs around $0.15–0.30 per merge for a mid-sized production LLM feature. That's comparable to running a moderately complex traditional integration test suite.

What this pattern doesn't solve: it doesn't replace observability on production traffic, it doesn't catch model provider drift that affects all versions equally, and it doesn't substitute for human review of significant prompt changes. It catches the regressions you cause yourself, which is most of them.

The discipline required to build this pipeline is smaller than the discipline required to recover from a regression that made it to production because nothing was checking.

References:Let's stay in touch and Follow me for more thoughts and updates