Skip to main content

The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.

This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.

Why Unit Tests Give False Confidence for LLM Features

The testing pyramid works because unit tests are cheap, fast, and precise. You call a function, assert the output, done. The implicit contract is: if all unit tests pass, the logic is correct.

That contract breaks immediately when the "logic" lives inside a language model. Consider a prompt-level unit test for a classification feature: you feed the model an input, assert it returns "positive". The test passes. You run it again. It returns "positive". You deploy. In production, under different batch load, different concurrent requests, and the slight floating-point nondeterminism that plagues GPU inference, that same input returns "neutral" 8% of the time. Your test never caught this because it ran deterministically in CI.

Nondeterminism in LLMs is more pervasive than most teams realize. Even at temperature=0, modern inference servers dynamically adjust batch sizes based on concurrent load, producing different outputs for the same input depending on what else is running at the same time. The test environment is always "quiet" — nothing else runs alongside your CI job. Production is never quiet.

Beyond nondeterminism, prompt-level unit tests suffer from a more fundamental problem: they assert the wrong things. A test that checks output == "I'd be happy to help with that" is testing the exact token sequence, not the semantic content. A test that checks "sorry" not in output is testing a surface pattern, not the intent. When you mock the LLM response entirely, you test the plumbing around the model while learning nothing about the model's actual behavior. Developers have documented exactly this failure: AI-generated tests passed cleanly in CI while asserting wrong structural invariants — equality on serialized outputs instead of semantic properties, hardcoded timestamps, mocks that masked the real timing and retry behavior.

The worst outcome of prompt-level unit tests is not that they fail — it is that they pass while hiding real failures. They create a false signal of correctness that delays the discovery of problems until users find them.

Where Failures Actually Live: The Tool Boundary

If unit tests don't catch the important failures, where do those failures actually occur? For most LLM applications, the answer is the boundary between the model and the external world: tool calls, API invocations, retrieval results, database writes.

This is the integration layer, and it is the highest-ROI place to invest testing effort for AI systems. Here is why: the LLM is a probabilistic decision-maker, and most of its decisions materialize as choices about which tool to call and what arguments to pass. When the model selects the wrong API endpoint, generates a malformed parameter, hallucinates a field name that doesn't exist in the schema, or calls a deprecated endpoint it remembers from training data — those failures are observable, testable, and consequential.

Integration tests at the tool boundary catch a class of failures that unit tests structurally cannot see:

  • The model consistently calls the search_orders tool when the user asks about returns, but the correct tool is search_return_requests — a tool selection failure that only appears when both tools exist simultaneously in the context.
  • The model generates {"date": "April 17"} when the tool schema requires {"date": "2026-04-17"} — a format mismatch that fails silently in testing but throws a 400 in production.
  • The model chains three tool calls that each succeed individually but accumulate context in a way that corrupts the fourth call — a sequential state failure invisible to per-call unit tests.

The practical approach here is record-and-replay: capture real tool interactions once against live APIs, then replay them deterministically in CI. This eliminates the mock-vs-reality gap that kills traditional integration testing for AI. The recorded responses are real, the agent behavior under test is real, and the test is still fast and cost-free after the initial recording. When a new model version or prompt change causes the agent to make different tool calls against the same recorded inputs, the test catches it.

System-Level Behavioral Evals: The Signal That Actually Predicts User Experience

Integration tests tell you whether the agent executes correctly. Behavioral evals tell you whether the agent does the right thing. These are different questions, and only the second one predicts user satisfaction.

A behavioral eval operates at the system level: given a realistic user input, did the complete system — model, tools, retrieval, orchestration — produce an appropriate output? "Appropriate" is deliberately not "correct" in the binary sense. There is no single correct answer to "summarize this contract and flag the unusual clauses." There are many acceptable answers and many unacceptable ones. Behavioral evals measure which category your system falls into, for a representative sample of inputs.

This is where LLM-as-judge evaluation earns its place. You use a second model to evaluate the output of the first, scoring against a rubric: did the response address the user's actual goal? Does it contain hallucinated facts? Is the tool selection sequence reasonable? Running multiple scoring rounds smooths the judge's own nondeterminism. The result is a distribution of quality scores over your eval set — not a binary pass/fail, but a rate.

The distinction between model-level and system-level evaluation matters operationally. If your system-level score drops, component scores reveal where the degradation lives: retrieval quality falling, tool selection accuracy dropping, or output formatting breaking. Without system-level evals, you have no baseline to alert on. With them, a 3-point drop in behavioral score is a paging condition, not a "hm, seems off" observation.

Behavioral evals also serve as the primary regression detection mechanism after model upgrades. When your LLM provider silently updates the underlying model — which they do, often without announcement — your integration tests will likely still pass (the API surface hasn't changed) while behavioral quality shifts. Only a system-level eval running against a curated set of representative tasks will catch this.

The Inverted Allocation: What the Distribution Should Look Like

The classic testing pyramid allocates effort as a triangle: many unit tests, fewer integration tests, fewest end-to-end tests. For LLM features, the allocation inverts toward the top. Here is a practical breakdown:

Deterministic technical tests (the base — run on every commit, lowest cost): Cover the scaffolding around the LLM: retry logic, timeout handling, schema validation on tool outputs, prompt template rendering, token budget enforcement, output parsing. These tests mock the LLM entirely because they are not testing model behavior — they are testing the code that wraps the model. They should be fast and cheap, and they should exist. But they tell you nothing about whether the feature actually works.

Record-and-replay integration tests (the middle — run on prompt or tool changes, moderate cost): Cover the agent's interaction with external systems using recorded real API responses. Run these when anything changes that could affect tool selection, argument generation, or multi-step sequencing: prompt edits, model upgrades, schema changes, new tools added. These catch the class of failures that break real workflows without requiring live API calls in CI.

Behavioral evals (the top — run continuously against production traffic sample, higher cost): Cover end-to-end task completion against realistic inputs. Run on 1–5% of production traffic continuously, and against a curated golden set on every deployment. These are the signal. When they drop, something is wrong with the feature regardless of what the lower layers report.

The shift from the classic pyramid is not about writing fewer unit tests — it is about recognizing that unit tests belong only to the scaffolding layer, not to the model behavior layer. Teams that write prompt-level unit tests as if they were testing a pure function are spending effort on tests that cannot catch the failures their users will actually encounter.

The Organizational Friction This Creates

There is a reason teams default to unit tests even for LLM features: unit tests are familiar, fast to write, and CI integration is trivial. Behavioral evals require new infrastructure — eval datasets, judge prompts, scoring pipelines, threshold tracking. Many teams treat this infrastructure as a luxury, building it "after we ship v1."

This is exactly backwards. Evals built before the feature launches establish the baseline. Evals built after a regression are playing catch-up against a known failure. The teams that report the most stable AI features in production are the ones that treat the eval infrastructure as foundational — not because evals are philosophically important, but because without them, the team has no way to know whether their changes make the feature better or worse.

The practical minimum for a new LLM feature: 50–100 representative task inputs with human-validated expected outputs, a judge prompt that scores the outputs against defined criteria, and a CI gate that fails the build if the behavioral score drops below a threshold. This does not require a dedicated ML platform. It requires the same rigor teams already apply to integration tests — applied to the layer where LLM failures actually live.

What This Means Concretely

Stop writing tests that assert exact model outputs. They will not pass consistently, and when they do, they are asserting the wrong things. Write deterministic tests only for the code paths that wrap the model: the retry handler, the schema validator, the token counter.

Invest heavily in the tool boundary. Record real API interactions and replay them in CI. This is the highest-ROI testing layer for agent systems and the place where most consequential failures manifest.

Run behavioral evals continuously. Treat a drop in system-level quality score as a deployment blocker. Build the eval dataset before you ship the feature, not after the first user complaint.

The classic testing pyramid describes how to test deterministic software. LLM features are not deterministic software. The test allocation strategy has to reflect where the failures actually are — which is not at the bottom of the pyramid.


The practical upshot: if your LLM feature has 200 unit tests and 5 evals, you have the allocation backwards. Flip it.

References:Let's stay in touch and Follow me for more thoughts and updates