Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

March 12, 2026 · 10 min read

Software Engineer

A team built an AI research assistant using Claude. They iterated on the prompt for three weeks, demo'd it to stakeholders, and launched it feeling confident. Two months later they discovered that the assistant had been silently hallucinating citations across roughly 30% of outputs — a failure mode no one had tested for because the eval suite was built after the prompt had already "felt right" in demos.

This pattern is the rule, not the exception. The LLM development industry has largely adopted test-driven development vocabulary — evals, regression suites, golden datasets, LLM-as-judge — while ignoring the most important rule TDD establishes: write the test before the implementation, not after.

Here is how to do that correctly, and the three places where the TDD analogy breaks down so badly that following it literally will make your system worse.

Evals Written After Prompts Are Weak By Construction

Traditional TDD works because the failing test defines success before any implementation exists. In practice, most LLM teams do the opposite: write the prompt, see outputs that look acceptable, then write tests that validate what the system already produces.

The result is an eval suite that has been unconsciously designed around the system's existing strengths and tuned to miss its blind spots. You end up with high scores on tests that were never going to fail, and no signal on failure modes you haven't thought to look for.

The eval-first workflow inverts this. Before writing any prompt, define 10–20 scenarios covering success cases and expected failure cases. Specify what a correct output looks like for each scenario, and choose an evaluation method: exact match for deterministic outputs, rubric-based scoring for quality, LLM-as-judge for subjective qualities. Build a minimal harness that can score outputs against these scenarios — even a spreadsheet works at this stage.

Then write the prompt to satisfy the evals.

The concrete evidence for why this matters: a team used this approach to build a Claude-based research skill, running automated eval cycles overnight. Starting from 41% accuracy on the defined eval scenarios, they reached 92% in four overnight cycles — a trajectory impossible without a pre-defined measurement that gave the feedback loop something to optimize against.

Evals written after the prompt would have been calibrated to whatever the prompt already produced at its initial accuracy level. The ceiling would have been much lower.

Making the Feedback Loop Fast Enough to Matter

The eval-first approach only works if you can run evals and act on results quickly. A feedback loop measured in hours kills iteration momentum. The target is minutes per cycle.

The practical tactics that make this achievable:

Start with fewer examples than you think you need. Fifty curated examples give meaningful signal faster than waiting to build a 500-example dataset. One team found that three issues accounted for most failures after analyzing 100 production traces — a result that emerged from annotating a small, representative sample, not from exhaustive coverage. Expand the dataset as you discover new failure modes, not upfront.

Tier your eval suite by cost. Run cheap, deterministic evals (exact match, regex, JSON schema validation) on every commit. Run expensive LLM-as-judge evals on significant prompt changes or pre-release. This keeps the continuous feedback loop fast while preserving thorough coverage for high-stakes decisions.

Parallelize test execution. LLM calls are IO-bound and embarrassingly parallel. Switching from sequential to async concurrent execution — the @pytest.mark.asyncio_cooperative pattern, for instance — cuts total eval runtime by 30–40% for suites with many model calls. At scale, this difference between 10-minute and 4-minute eval cycles compounds significantly.

Build domain-specific tooling early. One engineering team found that a custom tool showing the full conversation context on a single screen was "worth the investment" — without it, error analysis required too much context-switching across tabs to be fast. The investment in tooling returned through dramatically faster diagnosis of failures.

Use binary scoring in automated judges. LLM-as-judge evaluations that score on a 1–5 Likert scale produce inconsistent results: a judge that would score the same output as 3 on one run might score it as 4 on another. Binary PASS/FAIL (or at most three options) eliminates this inconsistency and makes threshold decisions straightforward. The feedback loop only moves fast if the scoring is reliable.

Where the TDD Analogy Breaks Down

Non-Determinism Is Structural, Not a Bug to Fix

Traditional unit tests assert exact outputs. Given input X, the code must produce exactly Y. LLMs break this at the foundation — and the break runs deeper than most practitioners realize.

Setting temperature=0.0 does not guarantee deterministic outputs. Research published in 2025 demonstrates that identical prompts, identical inference settings, and identical hardware configurations still produce output variation across runs due to floating-point rounding in parallel matrix operations and batching artifacts. The non-determinism is structural, not configurable away.

The necessary adaptation is probabilistic specification. Instead of "this must pass," define "this must pass in at least 80% of runs." Libraries like LangEvals encode this explicitly with patterns like @pytest.mark.pass_rate(0.8). The underlying philosophy shifts: testing LLMs is not about proving correctness, it is about documenting reliability thresholds.

This has a second-order effect on test design. Stopping a suite on the first failure — standard pytest behavior — gives a misleading picture of system health. In an LLM eval suite, a single failure on one run might be noise. All results should be logged and aggregated across the full run, giving a percentage-based health score rather than a binary pass/fail verdict.

There Is No Compile-Time Error for Prompt Mistakes

In traditional software, a typo in code or a broken import fails immediately and visibly. The developer gets instant feedback that something is wrong before the code ever runs.

Prompt engineering has no equivalent. A subtle wording change — a single word shift, a restructured sentence, a modified instruction ordering — can silently alter model behavior in ways that only manifest on edge cases at production scale. The "compilation" happens invisibly during inference, and the failure mode might not appear until the system has processed thousands of inputs.

This is why prompt versioning and regression testing are not optional infrastructure. They are the only mechanism that catches what the non-existent compile step would have caught. Teams that treat prompts as informal strings rather than versioned artifacts lose the ability to identify which change caused which regression. When a model upgrade silently drops tone scores from 0.85 to 0.72, the teams with eval infrastructure catch it in 20 minutes; the teams without it learn from customer complaints two weeks later.

Ground Truth Is a Distribution, Not a Point

Classical TDD assumes one correct answer per test case. For most LLM tasks, correct outputs form a distribution. A well-crafted email could take dozens of valid forms. A thorough code review comment could be phrased ten different ways. There is no single "right" output to assert against.

This forces a design decision that has no equivalent in traditional TDD:

Narrow the output specification until deterministic matching works. This is viable for structured tasks: JSON extraction, classification, entity recognition. For generative tasks, it underspecifies what matters.
Accept fuzzy evaluation using LLM-as-judge with a well-calibrated rubric. This scales, but introduces a second model that can itself fail, drift, or exhibit biases (favoring verbosity, favoring responses from models in the same family as the judge, favoring items presented first in pairwise comparisons).
Use human annotation as ground truth. High-stakes, expensive, and slow — but the only source of truth that doesn't inherit the biases of another model.

In practice, the right answer is a combination: human annotation to validate the judge, judge at scale to gate deployments, human review for high-stakes or novel task types.

Evals Are Not Portable Across Models

In software, a test suite that passes for one implementation of an interface should pass for any correct implementation. LLM evals do not transfer this way. Testing identical prompts across GPT-3.5-turbo, GPT-4o, and Llama-3-70b reveals different failure modes on each model. A prompt that achieves 87% accuracy on one model might drop to 61% on another with identical evaluation criteria.

The practical implication: eval suites are partially model-specific, and a model swap always requires re-running evals rather than assuming transfer. Infrastructure that versions prompts alongside model configurations, and ties eval results to specific (prompt version, model, settings) tuples, handles this correctly. Shared "language" benchmarks like HumanEval provide insufficient signal — models can exceed 90% on standard benchmarks while producing unacceptable outputs on the specific tasks your application requires.

The Golden Dataset Is the Core Artifact

The single most important artifact in LLM-driven TDD is the golden dataset: a curated set of inputs and expected outputs (or expected evaluation criteria) that defines what success looks like for your specific system.

The dataset is the specification. When the dataset is incomplete or only covers happy paths, the specification is incomplete. When the dataset grows stale and stops being updated with production failures, the specification diverges from reality.

Practical sizing guidance: 50–100 examples gets a system to a meaningful minimum viable eval; 200–500 covers major use cases and edge cases well enough for production gating; 1,000+ is appropriate for mature systems where new failure modes are regularly discovered and added from production traffic.

The sourcing strategy matters as much as the size. In rough order of quality:

Real production traces with privacy filtering, drawn from actual user behavior rather than imagined inputs
Domain expert-authored "must-pass" scenarios with explicit acceptance criteria for cases where the stakes of failure are highest
Edge cases, ambiguous inputs, and adversarial examples that stress-test the specification's boundaries
Synthetic data that has been reviewed and promoted by subject matter experts

The dataset is a living artifact, not a one-time build. Every production failure that evals failed to predict is evidence of a gap in the specification. Maintaining a feedback loop from production failures to dataset updates is what keeps the eval suite calibrated to real-world behavior over time.

Closing

TDD's most valuable contribution to LLM development is not the test-before-code workflow — it is the culture of treating the test as the specification. When evals are written before prompts, the team is forced to define success concretely before investing in implementation. When evals are written after, they measure comfortable ground rather than unknown territory.

The analogy breaks on non-determinism, on the absence of compile-time errors, on subjective ground truth, and on model portability. In each of these places, adapting TDD naively makes things worse — asserting exact outputs under stochastic inference, treating benchmark scores as production proxies, or assuming a prompt that passes evals on one model transfers cleanly to another.

The adaptation that works: probabilistic specifications instead of exact assertions, percentage-based health scores instead of binary pass/fail, model-specific eval configurations tied to versioned prompt artifacts, and golden datasets treated as living specifications rather than one-time builds.

The discipline is the same as TDD: define what done looks like before writing any implementation. The tooling and the tolerance for uncertainty are different.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

Evals Written After Prompts Are Weak By Construction

Making the Feedback Loop Fast Enough to Matter

Where the TDD Analogy Breaks Down

Non-Determinism Is Structural, Not a Bug to Fix

There Is No Compile-Time Error for Prompt Mistakes

Ground Truth Is a Distribution, Not a Point

Evals Are Not Portable Across Models

The Golden Dataset Is the Core Artifact

Closing

Recommended Reading

About Tian Pan

Evals Written After Prompts Are Weak By Construction​

Making the Feedback Loop Fast Enough to Matter​

Where the TDD Analogy Breaks Down​

Non-Determinism Is Structural, Not a Bug to Fix​

There Is No Compile-Time Error for Prompt Mistakes​

Ground Truth Is a Distribution, Not a Point​

Evals Are Not Portable Across Models​

The Golden Dataset Is the Core Artifact​

Closing​

Recommended Reading

About Tian Pan

Evals Written After Prompts Are Weak By Construction

Making the Feedback Loop Fast Enough to Matter

Where the TDD Analogy Breaks Down

Non-Determinism Is Structural, Not a Bug to Fix

There Is No Compile-Time Error for Prompt Mistakes

Ground Truth Is a Distribution, Not a Point

Evals Are Not Portable Across Models

The Golden Dataset Is the Core Artifact

Closing