Skip to main content

Non-Deterministic CI for Agentic Systems: Why Binary Pass/Fail Breaks and What Replaces It

· 9 min read
Tian Pan
Software Engineer

Your CI pipeline assumes something that hasn't been true since you added an LLM call: that running the same code twice produces the same result. Traditional CI was built for deterministic software — compile, run tests, get a green or red light. Traditional ML evaluation was built for fixed input-output mappings — run inference on a test set, compute accuracy. Agentic AI breaks both assumptions simultaneously, and the result is a CI system that either lies to you or blocks every merge with false negatives.

The core problem isn't that agents are hard to test. It's that the testing infrastructure you already have was designed for a world where non-determinism is a bug, not a feature. When your agent takes a different tool-call path to the same correct answer on consecutive runs, a deterministic assertion fails. When it produces a semantically equivalent but lexically different response, string comparison flags a regression. The testing framework itself becomes the source of noise.

The Three Assumptions CI Makes That Agents Violate

Traditional CI pipelines rest on three implicit assumptions that agentic systems violate on every run.

Assumption 1: Same input, same output. Even at temperature zero, LLM outputs vary. Research from 2025 demonstrated accuracy fluctuations of up to 10% across repeated identical inference runs with deterministic configurations enforced. Floating-point non-determinism in GPU operations, batching effects, and provider-side infrastructure differences all contribute. Your agent doesn't just produce different text — it might choose different tools, take different reasoning paths, and arrive at different intermediate states.

Assumption 2: Tests are binary. A traditional unit test either passes or fails. But when an agent produces a response that's 85% correct — right conclusion, wrong intermediate reasoning — what's the verdict? When it calls the right tools in a different order and still arrives at the correct answer, is that a pass or a regression? Binary assertions force you to collapse a spectrum of quality into two buckets, and both choices are wrong. Too strict and you're drowning in false failures. Too lenient and real regressions sail through.

Assumption 3: Individual test results are independent. In deterministic code, if test A passes and test B passes, the system works. But agentic systems have emergent behavior. Individual agents can pass every capability test while the composed system fails in ways that no single-agent analysis predicts. The state space when agents interact grows exponentially, and CI's assumption that you can test components in isolation breaks down.

Statistical Verdicts Replace Binary Assertions

The replacement for binary pass/fail is a three-valued probabilistic verdict system: Pass, Fail, and Inconclusive. This isn't a compromise — it's an acknowledgment that some test runs genuinely don't contain enough information to decide.

The framework borrows from clinical trial methodology. You define two error rates upfront: alpha (the probability of declaring a regression when none exists) and beta (the probability of missing a real regression). Then instead of running a test once, you run it multiple times and accumulate evidence.

The practical question is: how many times? Running your evaluation pipeline 10 times on identical data with all randomness fixed is a reasonable baseline. Calculate the coefficient of variation — standard deviation divided by mean — and target CV below 0.05. If your mean score is 80%, variation across runs should stay within 4 percentage points.

For production-grade confidence, the math gets more demanding. At 95% confidence with a 5% margin of error on a metric expected to be around 80%, you need approximately 246 samples. Cut the margin of error to 2.5% and you need 984. Most teams can't afford this on every PR.

The practical solution is Wald's Sequential Probability Ratio Test (SPRT), adapted from manufacturing quality control. Instead of pre-committing to a fixed number of trials, testing terminates as soon as sufficient evidence accumulates — either for or against regression. Research on this approach shows SPRT consistently reduces the number of required trials by 78% across agent testing scenarios. For detecting a 10% regression at standard error rates, SPRT reduces expected trials from roughly 109 to 34–52, depending on whether the agent has actually regressed.

This means your CI pipeline runs fewer trials when the answer is clear (the agent is obviously fine or obviously broken) and more trials only when behavior is genuinely borderline. Fast merges for clean changes, deliberate investigation for ambiguous ones.

Graduated Thresholds: The Soft Failure Zone

Even with statistical verdicts, you need to decide what constitutes a pass. The most effective pattern replaces binary gates with graduated thresholds.

An LLM-as-judge evaluation returns a score between 0 and 1. Define three zones:

  • Hard failure (below 0.5): Block the merge. Something is clearly wrong.
  • Soft failure (0.5 to 0.8): Allow the merge, but flag for review. The change is ambiguous — it might be a real regression or it might be acceptable behavioral variation.
  • Pass (above 0.8): Merge automatically.

The specific thresholds depend on your risk tolerance and domain. A medical agent needs a higher hard-failure threshold than a content summarizer. The critical insight is that the soft-failure zone exists at all — it's an explicit acknowledgment that some changes require human judgment that no automated system can replace.

Wire these thresholds into your CI so that hard failures block the PR, soft failures add a warning label and require an additional reviewer, and passes proceed normally. This preserves developer velocity for clean changes while catching genuine regressions.

Snapshot Testing for Agent Trajectories

Output correctness is necessary but not sufficient. An agent that returns the right answer via a completely different execution path might be fine — or it might be one prompt change away from catastrophic failure. You need to test the journey, not just the destination.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates