Skip to main content

Prompt Regression Tests That Actually Block PRs

· 10 min read
Tian Pan
Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

Why the "We Test Prompts" Claim Falls Apart

The typical prompt testing workflow looks like this: an engineer changes a system prompt, eyeballs a few outputs in a playground, pastes some examples into a chat thread to compare old vs. new, and ships it. Sometimes there's a shared eval spreadsheet. Sometimes someone wrote a Jupyter notebook that runs twenty examples and outputs a score. But almost never is that score wired to CI in a way that prevents a merge.

There are three reasons this is the status quo.

Non-determinism is used as an excuse. LLMs don't produce the same output twice (unless you force them to), so teams assume you can't write assertions. This is wrong. You can write deterministic assertions for a subset of your cases, and you can write probabilistic assertions with thresholds for the rest. The approach is different from unit tests, but the conclusion — this PR made things worse, don't merge it — is the same.

Eval cost is treated as fixed. If you naively run your entire eval suite on every commit at production temperature, yes, you'll spend too much. But you don't have to. The tractable version of prompt CI involves a carefully chosen small fixture set (30–100 examples), not your full production traffic sample. Most of the cost problem disappears when you stop conflating "CI gate" with "comprehensive offline evaluation."

No one owns the failure signal. In a traditional codebase, a broken unit test blocks the merge and pings the PR author. With prompt evals, results end up in a dashboard that someone has to remember to check. If the failure signal isn't integrated into the PR review flow, it will be ignored under deadline pressure. The CI integration isn't optional — it's the entire point.

The Structure of a Lightweight Prompt Test Harness

The goal is a suite that runs in under five minutes, costs under two dollars per run, and produces a pass/fail signal that your CI system can consume like any other test exit code. Here's how to build one.

Golden Fixture Files

A golden fixture is a versioned input/output pair that represents a case your prompt must handle correctly. Think of it like a snapshot test in frontend development — it captures known-good behavior and alerts you when something changes.

The fixture file for a support triage prompt might look like:

- id: billing-cancellation-intent
input: "I want to cancel my subscription and get a refund for this month"
expected_intent: billing_cancellation
expected_sentiment: negative
must_contain: ["refund", "cancel"]
must_not_contain: ["upgrade", "recommend"]

- id: technical-bug-report
input: "The export button does nothing when I click it in Firefox"
expected_intent: technical_bug
expected_sentiment: frustrated
must_contain: ["issue", "team"]

Good golden fixtures share a few properties. They're curated by humans, not auto-generated — a fixture is only valuable if you're confident the expected output is actually correct. They include edge cases and historically problematic inputs, not just easy cases your prompt has always handled well. And they're checked into version control alongside the prompt itself, so prompt changes and fixture changes travel together in the same commit.

The fixtures you write today will encode your understanding of what correct behavior looks like. That's worth investing time in.

Deterministic Assertions at Temperature Zero

For hard assertions — cases where there's a specific correct answer — run your model at temperature=0. This makes the model deterministic: given the same input and the same model version, it will produce the same output every time. You can now write assertions that will either pass or fail reliably.

This is the piece most teams miss. They run their CI evals at their production temperature (often 0.7 or higher) and then wonder why the results are noisy. Split your fixture set: deterministic cases use temperature=0 and hard assertions; ambiguous or creative cases use a separate evaluation path with soft scoring.

The deterministic subset doesn't need to be large. Thirty to fifty cases covering your most critical behaviors — the ones that would cause real user harm if they regressed — is enough to give you a meaningful gate.

LLM-as-Judge for Semantic Assertions

Not everything can be expressed as a string match or a classification label. For outputs where "correctness" is semantic — a summary that captures the key points, an explanation that's accurate without being verbatim — you can use a second model call to evaluate the first.

The pattern is simple: you send the model output plus a scoring rubric to a judge model and ask it to score on a scale or emit a pass/fail verdict with reasoning. The judge prompt matters a lot here. Vague rubrics produce inconsistent scores. Specific, decomposed criteria produce scores you can rely on.

A rubric for evaluating a product description output might look like: "Score 1 if the description mentions at least two specific product features, does not contain any factually incorrect claims, and is written in second person. Score 0 otherwise." That's something a judge model can evaluate consistently.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates