The Tool Schema Migration That Broke Your Agent's Retries for Two Weeks
The deprecation notice went out on a Tuesday. The downstream team rotated the response shape on their search tool — results[].snippet became results[].excerpt, a clean rename, six-week window, banner in the docs, three reminder emails to the engineering list. Every human consumer migrated. The agent did not, because the agent does not read email. For fourteen days the retry loop quietly parsed the new payload, found the field it was looking for missing, raised a KeyError, and counted that as a retryable failure. The retry hit the same endpoint, got the same new shape, raised the same error, gave up after three attempts, and returned an apologetic message to the user. The retry budget dashboard stayed green the entire time — retries were never exhausted, they were just permanently failing within budget. Success rate, measured at the tool layer, sat at zero on that path. Nobody looked because there was no page.
This is the shape of the failure that gets the most engineers in 2026: not the dramatic outage, but the silent contract drift where a human-facing migration runs to completion and the agent-facing one never starts because nobody knew there was one to run. The deprecation worked exactly as designed for the consumers it was designed for. The agent was a consumer nobody listed.
The seductive part of the story is that every component behaved correctly in isolation. The downstream team gave notice. The deprecation window was generous. The error handling caught the parse failure cleanly. The retry policy was bounded. The observability stack captured every tool call with structured logs. Each piece, on its own, would pass a design review. The failure lives in the gap between them — in the assumption that a deprecation announcement reaches the entity that has to migrate, and the assumption that a green retry budget means retries are working.
The deprecation channel the agent isn't subscribed to
API deprecation runs on a human substrate. The changelog gets posted, the docs banner appears, the email lands in someone's inbox, the breaking-changes calendar invite gets accepted. Industry best practice now sits at six to eighteen months between deprecation warning and sunset, and that window exists entirely so humans have time to read, prioritize, schedule, and ship the migration work.
An agent calling the same endpoint is subscribed to exactly one of those channels: the response body. It does not receive the email. It does not see the banner. The Deprecation and Sunset HTTP headers exist as a standard, but the agent's tool wrapper almost certainly strips them before the response reaches the model, because the wrapper was written to surface the part the LLM cares about — the answer. The deprecation metadata sits in a side channel the agent layer never wired up.
This is structurally identical to the way agents miss every other operational signal a human operator reads: rate-limit headers, retry-after hints, status-page banners, deprecation notices. The model is fluent in JSON and blind to everything next to it. The migration window that protects every human consumer protects exactly zero of your agent consumers, because the protection mechanism is "read your email."
The first lesson is administrative, not technical: every tool the agent depends on needs an owner who is on the deprecation list for that tool's upstream. Not "the team" — a named person whose job description includes "watch this tool's contract." If you cannot name them, your agent has no migration path that does not begin with an incident.
Why the retry loop made it worse
A well-meaning retry policy is the thing that converts "loud break" into "two-week silent break."
When the response shape changed, the first call's parser raised an exception. The agent harness saw an exception in tool execution, classified it as a transient failure — the same bucket as a network timeout — and retried. The second call returned the same shape. The third call returned the same shape. After three retries the agent gave up and reported the tool as unavailable, which the orchestrator handled by routing around it. The user got a worse answer, not an error.
Three things compounded the silence.
The retry policy treated parsing errors as transient. Most retry libraries default to retrying on any exception, because the alternative is enumerating every recoverable error class and updating that list forever. A schema mismatch is not transient — it will fail identically on every retry until the code changes — but the retry layer cannot tell the difference from a network blip. So it spends three attempts and three sets of tokens proving the same thing.
The retry budget metric measured ceiling, not floor. "Retries used this hour" stays well under the budget when every call exhausts its three retries — three is still three. The dashboard panel that would have caught this is not "retries used" but "retries that recovered." A retry that never recovers is a parse error, not a transient failure, and you want them split.
The fallback path was too graceful. The orchestrator was designed to degrade smoothly when a tool was unavailable, which is the right design when a tool is actually unavailable. It is the wrong design when a tool is always failing, because graceful degradation is indistinguishable from the tool not being in the plan. The agent quietly stopped using the search tool and the user got more confidently-wrong answers without anything labeling them as such.
The pattern that catches this is to split tool errors by recoverability tier. A network timeout is recoverable — retry it. A 5xx with a Retry-After header is recoverable — back off and try again. A parse error against a known schema is not recoverable from the agent's side; the schema changed, no number of retries will fix it, and the right behavior is to alert immediately, fail loudly, and surface the diff between the expected shape and the received shape so a human can see exactly which field moved. Conflating these tiers is what produces fourteen-day silent outages.
The metric you thought you had
"Retry budget" is one of those metrics that sounds like a safety belt and turns out to be a vanity number. A budget bounds how much you spend; it does not tell you whether the spending bought anything. The dashboard panel reads "retry rate: 1.2%, well under the 10% threshold," and the on-call engineer scrolls past it because the panel is doing its job — there is no breach.
What was actually happening underneath the green light:
- A small slice of calls were retrying.
- All of those retries were failing.
- The success rate of the underlying tool on that path was zero.
- The success rate of the overall agent task fell, but it was averaged across many tools and many task types, so the drop showed up as a one-point dip that looked like noise.
Every metric in that paragraph was being measured. None of them were composed into a signal that said "the retry layer is no longer producing recoveries on this path." The closest thing to a useful alert would have been: for any (tool, error_class) pair, alert when retry-recovery-rate drops to zero for more than one hour. That is a different shape of metric than "retry budget" — it is segmented by tool and by error class, and it watches a derivative (the recovery rate, not the absolute count). It is also the kind of metric that is annoying to build and easy to skip, which is exactly why nobody had it.
The deeper lesson is that any metric whose value can stay healthy while the thing it represents is broken is not a metric, it is decoration. Retry rate, error rate, and budget utilization are all in this category for agent systems. The metrics that actually work are recovery rate, end-to-end task completion, and quality-graded output checks — the ones that go red when the user experience degrades, not when a counter crosses a number.
Schema as the contract the agent cannot read between the lines of
The downstream team did not behave badly. They picked a field name that was more accurate. They followed every step of a textbook deprecation: announce, document, dual-write, sunset. The migration window was longer than most teams give. From their side, this was a clean, well-run change.
The failure is that the contract between the tool and the agent is not the OpenAPI spec or the README — it is the exact, current shape of the response payload, locked in at the moment the agent's tool wrapper was written and frozen there until somebody redeploys. A rename from snippet to excerpt is a backwards-incompatible change at that boundary even if it is forward-compatible everywhere else, because the consumer has no semantic understanding of the field — it has a data["snippet"] lookup compiled into Python and it will fail on data["excerpt"] until somebody changes the lookup.
A few practices reduce the blast radius:
Treat tool wrappers as compiled clients, not configuration. They have a version. That version pins the schema shape. When the schema moves, the wrapper version bumps, and there is a deploy that ships the new wrapper. The agent layer should refuse to call a tool whose wrapper is older than the tool's current minor version, the same way you would refuse to deploy a service whose dependencies are unresolved.
Add a contract test that runs against the live tool. A daily job that calls every registered tool with a known input and validates the response against the wrapper's expected schema. When it goes red, you know within hours, not weeks, that the contract has drifted. This is unglamorous and effective. It pays for itself the first time it fires.
Surface the deprecation metadata. If the upstream sends Deprecation, Sunset, or Warning headers, the tool wrapper has to forward them — as structured metadata, not into the model's context window. The agent orchestrator can then raise its own internal alert: "tool X reports sunset in 30 days, no wrapper update scheduled." This is the side-channel the agent layer has to wire up itself, because nobody else will.
Version the tool contract explicitly. If the wrapper says it expects schema v3 and the upstream rolls v4, the wrapper either upgrades or pins. Quietly drifting between unversioned shapes is how a six-week deprecation window becomes a two-week incident.
What "healthy" actually looks like
The end state is not "we never have schema drift." Upstreams will keep deprecating fields, and tool shapes will keep moving, and that is fine — the alternative is freezing the world. The end state is that schema drift is detected in hours, not weeks, and that the detection mechanism does not depend on a human reading email on behalf of the agent.
That requires three things composed together: a recovery-rate metric segmented by tool and error class so a permanent failure does not hide under a healthy budget; a parser that distinguishes schema mismatch from transient failure and refuses to retry the former; and a contract test that probes every tool's live response shape on a schedule independent of user traffic. Each of these is small. Together they collapse the silent-outage window from weeks to a single business day.
Most teams will not build all three until they have lived through a version of this incident. After they do, they discover that the work is much smaller than the cost of the incident, and that the missing piece was not engineering capacity but the recognition that the agent is a consumer of every API it touches and that no consumer should be invisible to its providers — or to itself.
The retry budget dashboard will stay green for the things it is designed to catch. It was never going to catch this one. Build the metric that does, and write down the name of the person who reads the deprecation emails on behalf of every tool the agent depends on. That is the migration plan.
- https://medium.com/@ThinkingLoop/when-one-tool-field-breaks-the-agent-447fdf627fe7
- https://dev.to/adamo_software/tool-use-api-design-for-llms-5-patterns-that-prevent-agent-loops-and-silent-failures-f29
- https://www.compiler.today/api-development/200-ok-the-success-response-that-was-actually-a-critical-error
- https://dev.to/milo_antaeus_784320e2f2f9/your-ai-agent-returns-200-and-is-wrong-the-silent-success-drift-pattern-8m3
- https://medium.com/@ThinkingLoop/13-agent-eval-tests-that-catch-silent-tool-failures-79ac312d70a4
- https://www.agentixlabs.com/blog/general/agent-observability-for-production-trace-tools-cost-and-safety-signals/
- https://oneuptime.com/blog/post/2026-02-02-api-deprecation/view
- https://www.aikido.dev/code-quality/rules/how-to-avoid-breaking-public-api-contracts-maintaining-backward-compatibility
- https://apxml.com/courses/langchain-production-llm/chapter-2-sophisticated-agents-tools/agent-error-handling
- https://futureagi.com/blog/what-is-llm-input-output-validation-2026/
