Skip to main content

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.

The result is predictable: teams either write no tests and rely on vibes, or they write e2e tests that are too slow to run in CI, too expensive to run on every PR, and too flaky to trust. Neither is a real testing strategy.

What the Classical Pyramid Got Right (and Why It Works for Normal Code)

The test pyramid, originally articulated by Mike Cohn and popularized by Martin Fowler, reflects a real economic insight. Unit tests are fast because they have no I/O — no network, no database, no external services. When a unit test fails, you know exactly which function broke. At scale, a 10,000-test suite should complete in under 90 seconds.

Integration tests are slower because they involve real subsystems — a database write, a service call — but you only need enough to confirm that your modules fit together correctly. E2E tests run the entire system from the user's perspective and are kept deliberately sparse because they're expensive to maintain and slow to execute.

The pyramid's shape encodes cost: cheap tests at the bottom in abundance, expensive tests at the top in moderation. The whole model works because units are deterministic pure functions. The same input produces the same output every time.

An LLM call is not a pure function. It is a probabilistic sampler over a distribution shaped by model weights, temperature, system prompt, prior conversation history, and sometimes the time of day (when model providers roll out silent updates). The "unit" of an agentic system is inherently non-deterministic. You cannot assert assertEqual(agent_output, expected) and have it mean anything.

How Agents Break the Pyramid at Every Layer

The unit layer has no units

In a conventional service, you can unit-test a parse_json(response) function or a format_prompt(context) helper. Those still exist in an agent codebase and should absolutely be tested. But the agent's actual intelligence — its decisions about which tool to call, how to decompose a task, when to ask for clarification — lives inside the model. You cannot inject a mock LLM and expect meaningful behavior. The model is the logic.

Some teams try to work around this by testing the prompt template in isolation. That catches obvious formatting bugs, but it tells you nothing about whether the rendered prompt actually produces the intended behavior. The gap between "the prompt compiled without errors" and "the agent does what we want" is where most bugs live.

Integration tests become tool-call tests, and tool calls are expensive

In a normal service, integration tests are moderately expensive because they hit a database. For an agent, every integration test calls the model (sometimes multiple times per turn, across multiple turns) and invokes real or semi-real tools. A single integration test run can cost 0.100.10–1.00 in API credits and take 30–120 seconds. A suite of 50 integration tests costs 55–50 per run and blocks CI for 25–100 minutes. That economics makes it impossible to run on every pull request.

End-to-end tests take minutes and cost dollars, but they're the only tests that catch real failures

An agent completing a realistic task — researching a topic, executing a multi-step coding workflow, or handling a support ticket — might make 15–30 model calls, call 5–10 tools, and produce intermediate artifacts that need to be inspected at each stage. This is not a 30-second test. And when it fails, the failure message is typically "the agent did not complete the task" — which could be caused by a flawed prompt, a changed tool interface, a model regression, or a subtle shift in reasoning that emerged from a prior prompt change three steps earlier.

Critically, agentic failures compound. A mistake in step 3 of a 10-step task corrupts every subsequent step. By the time a final assertion fires, the root cause is buried in a transcript that might be thousands of tokens long.

Non-determinism poisons every layer

Anthropic's own eval research found that single-run success rates for agents cluster around 68–74%, while requiring the agent to succeed in 8 consecutive runs drops to just 52–73%. That spread represents natural variance in a system you've done nothing wrong to — it's just how probabilistic systems behave. A test that passes 70% of the time is not a reliable quality gate; it is noise with a CI badge.

An Alternative: Three Tiers Built for Agents

What works instead is a three-tier approach that replaces the speed/isolation hierarchy with a cost/fidelity hierarchy appropriate for probabilistic systems.

Tier 1: Prompt contract tests (fast, cheap, deterministic)

These test the parts of your agent that are deterministic: prompt composition, tool schema definitions, response parsers, input sanitizers. The "prompt snapshot" pattern, borrowed from visual regression testing, captures the rendered prompt as a JSON artifact and compares it against a committed baseline. If a code change alters the system prompt text, the snapshot diff catches it — before any model is ever called.

A prompt contract test for a tool definition might assert: the JSON schema for the search_web tool has the required query parameter, its type is string, and no new parameters were added without documentation. These tests complete in milliseconds and run on every commit, zero model cost.

Prompt snapshot tests specifically catch the most common silent breakage pattern: someone refactors a prompt template helper and accidentally changes the instruction wording, which changes agent behavior, which is only discovered when a user complains two weeks later.

Tier 2: Tool interaction tests with recorded fixtures (moderate cost, controlled fidelity)

These correspond roughly to integration tests, but the key adaptation is tool tapes — recorded fixtures of tool responses, analogous to HTTP cassettes in VCR-style testing. Instead of calling the live web_search API or execute_code sandbox in CI, the agent runs against a replay of a previous session's tool responses.

A tool interaction test validates: given this starting state and these recorded tool responses, does the agent call the right tools in the correct sequence, with correct parameters? It also validates error-handling branches: when database_query returns a timeout error, does the agent retry with exponential backoff or gracefully tell the user it cannot complete the task?

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates