Skip to main content

Prompt Regression Tests That Actually Block PRs

· 10 min read
Tian Pan
Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

Why the "We Test Prompts" Claim Falls Apart

The typical prompt testing workflow looks like this: an engineer changes a system prompt, eyeballs a few outputs in a playground, pastes some examples into a chat thread to compare old vs. new, and ships it. Sometimes there's a shared eval spreadsheet. Sometimes someone wrote a Jupyter notebook that runs twenty examples and outputs a score. But almost never is that score wired to CI in a way that prevents a merge.

There are three reasons this is the status quo.

Non-determinism is used as an excuse. LLMs don't produce the same output twice (unless you force them to), so teams assume you can't write assertions. This is wrong. You can write deterministic assertions for a subset of your cases, and you can write probabilistic assertions with thresholds for the rest. The approach is different from unit tests, but the conclusion — this PR made things worse, don't merge it — is the same.

Eval cost is treated as fixed. If you naively run your entire eval suite on every commit at production temperature, yes, you'll spend too much. But you don't have to. The tractable version of prompt CI involves a carefully chosen small fixture set (30–100 examples), not your full production traffic sample. Most of the cost problem disappears when you stop conflating "CI gate" with "comprehensive offline evaluation."

No one owns the failure signal. In a traditional codebase, a broken unit test blocks the merge and pings the PR author. With prompt evals, results end up in a dashboard that someone has to remember to check. If the failure signal isn't integrated into the PR review flow, it will be ignored under deadline pressure. The CI integration isn't optional — it's the entire point.

The Structure of a Lightweight Prompt Test Harness

The goal is a suite that runs in under five minutes, costs under two dollars per run, and produces a pass/fail signal that your CI system can consume like any other test exit code. Here's how to build one.

Golden Fixture Files

A golden fixture is a versioned input/output pair that represents a case your prompt must handle correctly. Think of it like a snapshot test in frontend development — it captures known-good behavior and alerts you when something changes.

The fixture file for a support triage prompt might look like:

- id: billing-cancellation-intent
input: "I want to cancel my subscription and get a refund for this month"
expected_intent: billing_cancellation
expected_sentiment: negative
must_contain: ["refund", "cancel"]
must_not_contain: ["upgrade", "recommend"]

- id: technical-bug-report
input: "The export button does nothing when I click it in Firefox"
expected_intent: technical_bug
expected_sentiment: frustrated
must_contain: ["issue", "team"]

Good golden fixtures share a few properties. They're curated by humans, not auto-generated — a fixture is only valuable if you're confident the expected output is actually correct. They include edge cases and historically problematic inputs, not just easy cases your prompt has always handled well. And they're checked into version control alongside the prompt itself, so prompt changes and fixture changes travel together in the same commit.

The fixtures you write today will encode your understanding of what correct behavior looks like. That's worth investing time in.

Deterministic Assertions at Temperature Zero

For hard assertions — cases where there's a specific correct answer — run your model at temperature=0. This makes the model deterministic: given the same input and the same model version, it will produce the same output every time. You can now write assertions that will either pass or fail reliably.

This is the piece most teams miss. They run their CI evals at their production temperature (often 0.7 or higher) and then wonder why the results are noisy. Split your fixture set: deterministic cases use temperature=0 and hard assertions; ambiguous or creative cases use a separate evaluation path with soft scoring.

The deterministic subset doesn't need to be large. Thirty to fifty cases covering your most critical behaviors — the ones that would cause real user harm if they regressed — is enough to give you a meaningful gate.

LLM-as-Judge for Semantic Assertions

Not everything can be expressed as a string match or a classification label. For outputs where "correctness" is semantic — a summary that captures the key points, an explanation that's accurate without being verbatim — you can use a second model call to evaluate the first.

The pattern is simple: you send the model output plus a scoring rubric to a judge model and ask it to score on a scale or emit a pass/fail verdict with reasoning. The judge prompt matters a lot here. Vague rubrics produce inconsistent scores. Specific, decomposed criteria produce scores you can rely on.

A rubric for evaluating a product description output might look like: "Score 1 if the description mentions at least two specific product features, does not contain any factually incorrect claims, and is written in second person. Score 0 otherwise." That's something a judge model can evaluate consistently.

The cost of a judge call is usually smaller than the original inference, and you're only running it on cases that require semantic evaluation. For a 50-case fixture set, this adds a few cents per run.

Failing the PR

The harness needs to emit an exit code. Exit 0 on pass, exit 1 on fail — that's how every CI system in the world knows whether to let the merge through. This sounds obvious but is surprisingly easy to omit when you're iterating on eval notebooks that just print results.

Tools like Promptfoo and DeepEval are designed for exactly this. Promptfoo's CLI exits with a non-zero code when assertions fail. DeepEval integrates with pytest, so failures surface as test failures in your standard test runner. Either works. What matters is that the failure signal reaches the PR in a format that developers cannot ignore.

Cost Management Without Compromising Coverage

The most common objection to prompt CI is cost. Here's what a realistic budget looks like.

A 50-case fixture set with a mix of deterministic assertions and judge evaluations typically runs in 60–120 seconds and costs 0.200.20–0.80 depending on model choice and output length. That's under a dollar per PR. For a team shipping 20 PRs a day, you're looking at $15–20/day — less than most teams spend on coffee.

A few practices keep costs predictable:

Use a smaller judge model. You don't need your flagship model to evaluate structured criteria. A smaller, faster model used as a judge is often just as accurate for well-specified rubrics and costs a fraction of the price.

Run the full fixture set only on prompt changes. Most PRs don't touch prompts. Gate the prompt eval on changed files — if prompts/ or evals/ directories are untouched, skip the expensive fixture run. Your CI system can detect changed files before deciding which jobs to run.

Use batch APIs for offline evaluation. When you're doing comprehensive quarterly evals rather than PR gates, batch APIs (available from most major providers) cut costs by roughly 50% in exchange for async completion. Save the synchronous runs for CI; use batch for depth.

The Three Traps to Avoid

Eval Drift

Your judge model will be updated, deprecated, or changed by its provider. Your golden fixtures will accumulate over time and stop reflecting your actual production traffic distribution. Both of these are slow erosions that make your CI gate less meaningful without anyone noticing.

Schedule a quarterly review: run your fixture set, check whether the failing cases are actually wrong or whether your rubrics have drifted, and retire fixtures that no longer represent real scenarios. Track your judge model version explicitly. When you change it, re-run the full fixture set and treat score changes as a migration event, not a surprise.

Flaky Determinism

Even at temperature=0, some models exhibit non-determinism in edge cases due to floating-point parallelism differences across hardware. If you're seeing tests that occasionally flip without prompt changes, don't just increase your failure threshold — investigate the specific cases. Usually the fixture is testing something that sits at a decision boundary. Either tighten the assertion or replace the fixture with a clearer case.

Mistaking Coverage for Signal

A 200-case fixture set feels comprehensive. It isn't. Prompt failures in production often involve inputs that look nothing like your fixtures — a user who writes in broken English, an input that triggers an edge case in your tool schema, a conversation turn with unusual whitespace. Coverage of the fixture set says nothing about coverage of the input space.

The goal of your CI gate is not exhaustive coverage. It's catching the most common regression patterns quickly and cheaply. Comprehensive evaluation — covering diverse input distributions, adversarial cases, and long-tail behaviors — belongs in your pre-release evaluation process, not your PR gate. Confusing the two leads either to expensive gates or to false confidence in cheap ones.

Setting Up the CI Integration

If you're using GitHub Actions, the integration is about ten lines:

- name: Run prompt regression tests
run: npx promptfoo eval --config evals/ci-fixtures.yaml
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Promptfoo reads your fixture YAML, runs the assertions, and exits with a non-zero code on failure. The failed assertion details appear in the job output. Your branch protection rules can require this job to pass before merging, same as any other required status check.

For Python shops using DeepEval, the setup is similar — your eval file is a pytest file, so it runs with pytest evals/ci_fixtures.py and integrates with any existing test reporting infrastructure.

The Mindset Shift

The teams that successfully run prompt CI don't think of it as "testing AI" — an activity that feels fundamentally uncertain and hard to quantify. They think of it as change detection: given that I changed this prompt, did it break anything I know should work? That framing is much easier to operationalize.

You're not trying to prove your prompt is good. You're trying to catch regressions before they reach users. The same bar you apply to code — if you changed the login flow, you run the login tests; if those tests fail, you don't merge — applies to prompts. If you changed the support triage prompt, you run the triage fixtures. If three fixtures that used to pass now fail, something changed and you need to understand it before shipping.

The tooling exists. The cost is manageable. What's missing in most teams is the decision to treat prompt changes as code changes — subject to the same review rigor and the same automated safety nets.

Start small: ten fixtures, one judge model, one GitHub Actions step. The goal for week one is not comprehensive coverage — it's a green checkmark that turns red when something breaks. Once you have that, expanding coverage is just adding fixtures. The infrastructure is already there.

References:Let's stay in touch and Follow me for more thoughts and updates