Skip to main content

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.

The result is predictable: teams either write no tests and rely on vibes, or they write e2e tests that are too slow to run in CI, too expensive to run on every PR, and too flaky to trust. Neither is a real testing strategy.

What the Classical Pyramid Got Right (and Why It Works for Normal Code)

The test pyramid, originally articulated by Mike Cohn and popularized by Martin Fowler, reflects a real economic insight. Unit tests are fast because they have no I/O — no network, no database, no external services. When a unit test fails, you know exactly which function broke. At scale, a 10,000-test suite should complete in under 90 seconds.

Integration tests are slower because they involve real subsystems — a database write, a service call — but you only need enough to confirm that your modules fit together correctly. E2E tests run the entire system from the user's perspective and are kept deliberately sparse because they're expensive to maintain and slow to execute.

The pyramid's shape encodes cost: cheap tests at the bottom in abundance, expensive tests at the top in moderation. The whole model works because units are deterministic pure functions. The same input produces the same output every time.

An LLM call is not a pure function. It is a probabilistic sampler over a distribution shaped by model weights, temperature, system prompt, prior conversation history, and sometimes the time of day (when model providers roll out silent updates). The "unit" of an agentic system is inherently non-deterministic. You cannot assert assertEqual(agent_output, expected) and have it mean anything.

How Agents Break the Pyramid at Every Layer

The unit layer has no units

In a conventional service, you can unit-test a parse_json(response) function or a format_prompt(context) helper. Those still exist in an agent codebase and should absolutely be tested. But the agent's actual intelligence — its decisions about which tool to call, how to decompose a task, when to ask for clarification — lives inside the model. You cannot inject a mock LLM and expect meaningful behavior. The model is the logic.

Some teams try to work around this by testing the prompt template in isolation. That catches obvious formatting bugs, but it tells you nothing about whether the rendered prompt actually produces the intended behavior. The gap between "the prompt compiled without errors" and "the agent does what we want" is where most bugs live.

Integration tests become tool-call tests, and tool calls are expensive

In a normal service, integration tests are moderately expensive because they hit a database. For an agent, every integration test calls the model (sometimes multiple times per turn, across multiple turns) and invokes real or semi-real tools. A single integration test run can cost 0.100.10–1.00 in API credits and take 30–120 seconds. A suite of 50 integration tests costs 55–50 per run and blocks CI for 25–100 minutes. That economics makes it impossible to run on every pull request.

End-to-end tests take minutes and cost dollars, but they're the only tests that catch real failures

An agent completing a realistic task — researching a topic, executing a multi-step coding workflow, or handling a support ticket — might make 15–30 model calls, call 5–10 tools, and produce intermediate artifacts that need to be inspected at each stage. This is not a 30-second test. And when it fails, the failure message is typically "the agent did not complete the task" — which could be caused by a flawed prompt, a changed tool interface, a model regression, or a subtle shift in reasoning that emerged from a prior prompt change three steps earlier.

Critically, agentic failures compound. A mistake in step 3 of a 10-step task corrupts every subsequent step. By the time a final assertion fires, the root cause is buried in a transcript that might be thousands of tokens long.

Non-determinism poisons every layer

Anthropic's own eval research found that single-run success rates for agents cluster around 68–74%, while requiring the agent to succeed in 8 consecutive runs drops to just 52–73%. That spread represents natural variance in a system you've done nothing wrong to — it's just how probabilistic systems behave. A test that passes 70% of the time is not a reliable quality gate; it is noise with a CI badge.

An Alternative: Three Tiers Built for Agents

What works instead is a three-tier approach that replaces the speed/isolation hierarchy with a cost/fidelity hierarchy appropriate for probabilistic systems.

Tier 1: Prompt contract tests (fast, cheap, deterministic)

These test the parts of your agent that are deterministic: prompt composition, tool schema definitions, response parsers, input sanitizers. The "prompt snapshot" pattern, borrowed from visual regression testing, captures the rendered prompt as a JSON artifact and compares it against a committed baseline. If a code change alters the system prompt text, the snapshot diff catches it — before any model is ever called.

A prompt contract test for a tool definition might assert: the JSON schema for the search_web tool has the required query parameter, its type is string, and no new parameters were added without documentation. These tests complete in milliseconds and run on every commit, zero model cost.

Prompt snapshot tests specifically catch the most common silent breakage pattern: someone refactors a prompt template helper and accidentally changes the instruction wording, which changes agent behavior, which is only discovered when a user complains two weeks later.

Tier 2: Tool interaction tests with recorded fixtures (moderate cost, controlled fidelity)

These correspond roughly to integration tests, but the key adaptation is tool tapes — recorded fixtures of tool responses, analogous to HTTP cassettes in VCR-style testing. Instead of calling the live web_search API or execute_code sandbox in CI, the agent runs against a replay of a previous session's tool responses.

A tool interaction test validates: given this starting state and these recorded tool responses, does the agent call the right tools in the correct sequence, with correct parameters? It also validates error-handling branches: when database_query returns a timeout error, does the agent retry with exponential backoff or gracefully tell the user it cannot complete the task?

Because tool calls are mocked, these tests complete in under 10 minutes for a suite of 30–50 scenarios and cost only the model inference to generate the agent's decisions. They run nightly or on PRs touching agent logic, not on every commit. The tool-tape pattern also solves a real debugging problem: when a test fails, you have the complete recorded transcript of what the agent did, which tool it called incorrectly, and what response it received — a much cleaner failure signal than a live run's chaos.

Tier 3: Goal completion tests (expensive, full fidelity)

These are deliberately kept small: 10–20 scenarios representing your highest-stakes workflows. They run against live models and real tools, in a staging environment. They measure task completion rate — did the agent actually accomplish the stated goal? — using a combination of environment inspection (did the file get created? did the database record get updated?) and LLM-as-judge rubrics for outcomes that can't be verified programmatically.

The critical shift from classical e2e tests is the scoring approach. Instead of binary pass/fail, goal completion tests produce a statistical signal over multiple runs. Rather than asserting "this test passed," you assert "this test passes at least 80% of the time across 5 runs" — what Anthropic's eval framework calls pass@k criteria. This converts the inherent non-determinism from a testing liability into a managed quality metric.

Goal completion tests run nightly or on release candidates, never on every PR.

CI Infrastructure Patterns at Each Tier

The three-tier model maps to a concrete CI pipeline structure:

  • Every commit: Prompt contract tests run in under 2 minutes, no model calls, no API keys required. These gate every merge.
  • Every PR touching agent logic: Tool interaction tests run against recorded fixtures, targeting under 10 minutes total. Require model API access but no live tools. A breaking PR is blocked from merging.
  • Nightly: Goal completion tests run against live infrastructure, produce statistical pass-rate reports, and alert on regression. A single failing run does not block anything; a trend triggers review.
  • Pre-release: Full goal completion suite across all scenarios, with a statistical threshold (e.g., 80% pass rate across 5 runs per scenario) as a hard gate before production deployment.

This structure ensures that the expensive, slow, flaky tests exist but don't block day-to-day engineering velocity. The cheap, fast tests provide confidence for routine development while the expensive ones provide confidence for releases.

Failure Modes That Escape Traditional Testing

Several categories of agent failures are invisible to conventional pyramid testing:

Trajectory drift: The agent reaches the right final answer via the wrong sequence of tool calls — wasting time, money, or triggering unintended side effects. A final-output assertion says nothing about whether the agent called read_file 12 times when once would suffice, or whether it sent an API request that had an unintended side effect before arriving at the correct answer. Only trajectory evaluation catches this.

Prompt injection through tool outputs: An adversarial string returned by a web search or code execution can redirect the agent's behavior. Traditional unit tests of the prompt template will never surface this because the injection comes from runtime data, not from the static prompt.

Compounding reasoning errors: Step 3 of a 10-step task produces a subtly wrong intermediate result. Steps 4–10 succeed given that wrong input. The final output looks plausible but is incorrect. A single-step evaluation never triggers; an end-to-end evaluation might not catch it if the final-state check is not granular enough.

Silent model drift: The model provider silently updates a model version. The agent's behavior shifts gradually. None of your deterministic tests catch this because they don't call the model. Only production monitoring or periodic full-suite runs detect the regression, often after it has already affected users.

Tool schema evolution: A third-party tool your agent depends on changes its parameter names or response format. Your prompt contracts pass. Your tool interaction tests pass (they use recorded fixtures). The failure only surfaces in goal completion tests or production. This argues for including at least a few live-tool integration tests that deliberately exercise real external dependencies on a schedule.

What Teams Are Actually Doing

The most mature teams treat agent evaluation with the same rigor as production services:

  • Versioned prompts: stored in version control alongside the code that renders them
  • Automated quality gates: CI pipelines that block merges when agent quality drops below threshold
  • Staged rollouts: canary deployments routing 1–5% of traffic to new agent versions
  • Explicit rollback procedures: clear criteria and tooling to revert when metrics decline

LangChain's own evaluation readiness framework recommends that 60–80% of evaluation effort focus on error analysis before automation — a deliberate inversion of the traditional approach. Rather than starting with automation and debugging failures, teams should manually review 20–50 real production traces first, build a taxonomy of failure types, and then build automated tests that target those specific failure modes. This grounds the test suite in observed failures rather than hypothetical ones.

The rise of dedicated prompt evaluation tooling — promptfoo, Braintrust, DeepEval, Confident AI — signals how central this problem has become. Prompt evaluation and regression testing is now infrastructure-level, not a niche concern. Teams that treat it as an afterthought accumulate invisible quality debt that surfaces as user complaints rather than test failures.

The New Mental Model

The classical test pyramid is a shape chosen to optimize for two properties: speed and confidence. Cheap tests at the bottom give you fast feedback. Expensive tests at the top give you system confidence at a manageable cost. The pyramid shape works when cost tracks with scope.

For agents, cost and scope are decoupled. A single-step evaluation is neither fast nor cheap because it calls a model. But a trajectory evaluation over 10 steps isn't orders of magnitude more expensive than one over 2 steps — the cost curve is sublinear once you've paid the model-call overhead.

The right mental model is not a pyramid but a cost-fidelity grid. Place tests along two axes: how much do they cost to run, and how faithfully do they represent real agent behavior? You want as much of the high-fidelity surface covered as possible, and you want to minimize cost for the lowest-fidelity checks that run most frequently.

Prompt contract tests are cheap and low-fidelity. Tool interaction tests with fixtures are moderate-cost and moderate-fidelity. Goal completion tests are expensive and high-fidelity. You run the cheap ones constantly, the moderate ones often, and the expensive ones deliberately.

The goal is not to invert the pyramid. It is to stop forcing a probabilistic, multi-step, externally-coupled system into a framework built for deterministic, isolated functions — and to build something that actually fits the problem.

References:Let's stay in touch and Follow me for more thoughts and updates