Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

April 23, 2026 · 9 min read

Software Engineer

Your agent hits 94% on the eval suite. Your on-call rotation is on fire. Nobody in the room is lying; both numbers are honest. What's happening is that the harness is testing a prompt, and production is testing an agent, and those are two different artifacts that happen to share weights.

Mocked-tool evals are almost always how this gap opens. You stub search_orders, charge_card, and send_email with canned JSON, feed the model a user turn, and assert on the final response. The run is cheap, deterministic, and reproducible — every property a CI system loves. It is also silent on tool selection, latency, rate limits, partial failures, and retry behavior, which is to say silent on the set of failures that dominate post-incident reviews.

This isn't an argument against mocked evals. They are the right tool for prompt-level assertions and they always will be. The argument is that a test suite made entirely of mocked tools is a suite that has never seen your agent. Below is the shape of the gap, why the economic gravity keeps pulling teams into it, and the layered approach that actually catches the incidents teams keep having.

The Three Assumptions a Mocked-Tool Eval Silently Makes

Every mocked tool response encodes three assumptions the harness will never check:

The agent picked the right tool to call.
It passed the right arguments in the right shape.
It received a well-formed response on the first try.

In production, all three fail constantly. Agents call search_products when the user asked for an order status because the two tool descriptions overlap by one noun. They pass a timestamp as a string when the schema expects epoch seconds. They hit a 429 from a downstream API and the model's next turn is "I apologize for the confusion, let me try that again" — into an identical call that fails the same way. A mocked eval is blind to all three modes because the mock accepts any argument shape, always returns on the first attempt, and implicitly vouches for a tool choice the harness hand-wired.

One practitioner survey of production agent incidents found roughly four recurring archetypes: premature action without grounding, over-helpfulness that fabricates missing entities, distractor-induced context pollution, and fragile execution under load. All four are tool-interaction failures. Zero of them reliably appear in a mocked-tool suite, because the mock dodges the loop where they live.

Why Teams Mock In the First Place

The economic frame matters, because "just use real tools" isn't a real option for most teams.

Real-tool evals are 10 to 50 times more expensive than mocked ones. You pay for LLM tokens on every turn the agent takes, plus the downstream APIs the tools call, plus the shared fixtures (test accounts, sandbox databases, seed data) that have to exist and stay clean. A 200-case eval suite that runs in 90 seconds on mocks can run in 40 minutes on real tools, and that's the happy path — before third-party flakiness, sandbox rate limits, and the "one test leaves residue that breaks the next test" problem.

Flakiness is the other half of the tax. A real-tool suite that is green 96% of the time is not green — it is noise. Developers stop trusting it, stop running it locally, and stop blocking on it in CI. Within a quarter the suite is advisory, then ignored, then deleted, and the team ends up where they started with mocks plus bitterness.

The mistake isn't mocking. It's treating the mocked suite as the test of record for agent behavior when it is only the test of record for prompt behavior.

The Specific Failures a Mocked Suite Hides

It's worth listing these concretely, because the abstraction "tool-interaction failures" lets teams nod along without changing what they test.

Tool selection errors are the top of the list. Your agent has twelve tools and the model picks the wrong one. Mocks don't catch this because the harness usually hardcodes which mock responds. Even if you set up routing, the mocks don't penalize selecting a tool whose real response would have been useless — the fake response is always plausible.

Argument-shape errors come second. The model outputs {"user_id": "12345"} when the tool expects an integer. In production the API returns a 400; in the mock you wrote the schema loosely because the test doesn't care, and the agent barrels on to a "successful" next turn.

Latency-induced timeouts are the ones that wake people up. A tool's p99 latency crosses the agent's per-turn budget and the orchestrator starts killing turns mid-flight. What the model does when its tool call dies in flight — retry? give up? hallucinate the result? — is entirely determined by your harness wiring and is untested under mocks, which return instantly.

Rate-limit cascades are next. The agent hits a 429, decides to retry, hits another, and because nothing in the loop is counting attempts, the agent burns its entire token budget on retries of the same failing call. This pattern only exists when real tools are reachable and real backpressure is applied. Mocks have no 429 state.

Partial failures are the last one that matters here. A parallel tool call returns two results and one timeout. What does the agent do with the timeout? Report it honestly? Silently omit it? Assume success? Every team finds out in prod because the mocked test dispatched a single synthetic dict and moved on.

The Eval Pass-Rate / Incident-Rate Divergence Signal

There is a specific organizational tell for a team living inside a mocked-only eval world: the eval pass rate keeps climbing and the production incident rate refuses to follow it. Quarter over quarter, the harness gets greener and the on-call pages don't thin out. Eng leadership asks why and nobody has a clean answer, because the two numbers measure different artifacts.

Measured gaps in the wild run wide. A single published example saw a 23-point delta between a 91% eval score and a 68% production success rate, and a separate 8-point gap between per-component theoretical success and system-level real success driven entirely by coordination failures across tools. Those are the numbers you see when the eval is testing a prompt and the incident tracker is tracking an agent.

The actionable move when this divergence opens up isn't to tighten the mocked suite further. It's to build a second surface of evaluation that sees the failures the mocked suite structurally cannot.

The Hybrid Ladder That Actually Catches Incidents

What works in practice is treating tool interactions as a laddered evaluation problem, where each rung tests a different surface and costs a different amount. Run all three; don't pretend one of them is the other.

Rung one: mocked unit evals. Fast, cheap, deterministic. These assert on prompt-level behavior — tone, refusal, structured-output shape, PII redaction, basic reasoning. They run on every PR. The bet is that prompt behavior is a stable surface the model can't easily regress without someone noticing. Keep the mocks strict on argument shape so argument-shape regressions surface here instead of in prod.

Rung two: recorded-cassette evals. Borrowed from VCR-style HTTP record-replay testing, these capture a real interaction once — real model, real tool calls, real responses — and replay them deterministically afterward. If the agent's next run produces the same sequence of tool calls with the same arguments, the cassette replays and the test is fast and free. If the sequence diverges, the test fails loudly and forces a human to re-record or reject. This is the rung where tool-shape regressions surface: schema changes in downstream APIs, tool-description churn that changes the model's selection, argument-format drift. Run these on every merge to main, not every PR.

Rung three: live-tool smoke suite. A small, hand-curated set of scenarios — five to twenty, not two hundred — that run against real tools in a real sandbox on a scheduled cadence (hourly, nightly, or on canary deploys). This is where latency, rate-limit, and partial-failure behavior surface. Keep it small deliberately; the job of this suite is not coverage, it's to tell you "the agent still works when the network is real," and a small suite you trust beats a large one you learn to ignore.

The three rungs cover different failure modes and cost wildly different amounts. A healthy team runs all three, monitors each separately, and uses the gap between rungs as a diagnostic. When the mocked suite is green and the cassette suite is red, a downstream API changed shape. When the cassette suite is green and the live smoke is red, latency or rate limits moved. When all three are green and production is still on fire, your evals have nothing to say about the failure and you have a new incident type to add to rung three.

What to Do On Monday

The practical starting move is modest: add a cassette-replay rung between the mocks and prod. You do not need a new framework. pytest-recording, vcr.py, and language-specific equivalents record HTTP requests and responses into YAML files; a record mode hits the real API once and a replay mode blocks all outbound calls and matches against the cassette. LangChain and similar agent frameworks now document this pattern explicitly because it solves the "my eval is a lie" problem at the right price point.

Once you have cassettes, the failure mode changes. Instead of "we didn't know the schema changed," you get a PR that fails with a clean diff: the agent called search_orders with limit: 50 on Monday, and today it's calling search_orders with limit: "50" as a string. You still don't know why the model changed, but you know it changed, and you caught it before a user did.

The bigger shift is mental. An eval suite that never touches a real tool is a test of a prompt, not of an agent. Name it accordingly in your own head and in your team's dashboards. Call the mocked suite "prompt evals," call the cassette suite "tool-interaction evals," and call the live smoke "agent evals." Three names for three surfaces, each tested where it lives. The incident rate will follow the metric that actually measures it, and the gap will finally close.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

The Three Assumptions a Mocked-Tool Eval Silently Makes

Why Teams Mock In the First Place

The Specific Failures a Mocked Suite Hides

The Eval Pass-Rate / Incident-Rate Divergence Signal

The Hybrid Ladder That Actually Catches Incidents

What to Do On Monday

Recommended Reading

About Tian Pan

The Three Assumptions a Mocked-Tool Eval Silently Makes​

Why Teams Mock In the First Place​

The Specific Failures a Mocked Suite Hides​

The Eval Pass-Rate / Incident-Rate Divergence Signal​

The Hybrid Ladder That Actually Catches Incidents​

What to Do On Monday​

Recommended Reading

About Tian Pan

The Three Assumptions a Mocked-Tool Eval Silently Makes

Why Teams Mock In the First Place

The Specific Failures a Mocked Suite Hides

The Eval Pass-Rate / Incident-Rate Divergence Signal

The Hybrid Ladder That Actually Catches Incidents

What to Do On Monday