Skip to main content

Tool Manifest Lies: When Your Agent Trusts a Schema Your Backend No Longer Honors

· 10 min read
Tian Pan
Software Engineer

The most dangerous bug in a production agent isn't the one that throws. It's the one where a tool description says returns user_id and the backend quietly started returning account_id two sprints ago, and the model is still happily inventing user_id in downstream reasoning — because the manifest said so, and the few-shot history reinforced it, and nothing in the loop ever fetched ground truth.

This is manifest drift: the slow, silent divergence between what your tool descriptions claim and what your endpoints actually do. It rarely produces stack traces. It produces bad decisions with clean audit trails — the worst class of bug in agent systems.

Every team with agents in production has this problem, whether they know it or not. The manifest is a contract, the backend is the reality, and under normal software evolution the two are expected to disagree sometimes. What makes agents different is that the model doesn't know they disagree. It builds an implicit world model out of tool names, descriptions, parameter docs, and prior action-observation pairs — then reasons from that model with full confidence, long after any of it was true.

Why Manifest Drift Is Uniquely Bad for Agents

Traditional API clients break loudly. A renamed field produces a KeyError, a type mismatch throws a deserialization exception, a deprecated endpoint returns 410 and CI catches it. The signal is fast, precise, and attributable.

Agents fail differently. When the manifest says transaction_amount is in cents and the backend silently switches to dollars, the model doesn't crash — it reasons about thresholds that are off by 100x, emits syntactically valid queries, writes perfectly fluent summaries, and ships wrong numbers into a dashboard a human will read tomorrow. When a status enum gains a new value like pending_review, the agent's decision logic (trained on three values from the docs and five from the few-shot examples) quietly buckets the new cases into the wrong branch, or ignores them.

The asymmetry is structural. A language model is an excellent mimic. When its context is full of past action-observation pairs that used customer_id, it will keep emitting customer_id even after the backend renames the field to account_id — because the pattern in the context is louder than any one description change. The model's mental model of your API outlives any single schema update, and it gets more confident over time as few-shot history accretes.

There are three drift classes worth naming:

  • Surface drift. A field is renamed or moved. The schema still validates loosely, but downstream reasoning references a field that no longer exists under that name. The backend returns null or fills from a default; the agent treats the null as meaningful.
  • Semantic drift. The shape is identical but the meaning changed. Units, timezones, currencies, enum semantics, default values, what "active" counts as. These are the hardest to detect because no schema checker will ever flag them.
  • Behavioral drift. The endpoint still accepts the same input and returns the same shape, but its side effects changed. It now hits a different downstream system, enforces a new rate limit class, has different idempotency guarantees, or rolls up results differently.

Only surface drift shows up in a JSON-schema diff. Semantic and behavioral drift are invisible to every static check your CI runs.

The Model's Mental Model Outlives Your Schema Update

Here's the counterintuitive part. You can update the manifest. You push a new tool description, ship it to the agent framework, cut a release. The problem is that the model's behavior is shaped by the composition of what's in its context window: the tool description, yes, but also the system prompt, the task instructions, the retrieval context, the prior turns' tool calls, the trace examples you include as few-shot, and any summaries of earlier steps. Updating one of those does not flush the others.

Concretely: you rename user_status to account_state in the manifest. The next agent run sees the new description. But its prior turns in the same conversation still reference user_status. Its few-shot examples still show user_status. The retrieval corpus you're pulling documentation from hasn't been re-indexed. The compacted summary of the first twenty turns talks about user_status. The model, being a pattern-matcher, follows the dominant signal — and emits the old field name in its next tool call, or worse, writes analysis that blends the old and new names into a consistent-sounding but wrong narrative.

This is why manifest drift can't be fixed by "just update the tool description." The description is one voice in a room where the model is listening to the whole crowd, and the crowd has accumulated years of the old convention.

The practical consequence: any schema change needs to be treated as a full context hygiene event, not a one-line edit. Few-shot examples get refreshed. Retrieval indexes get rebuilt. Stored trajectories get re-summarized or marked obsolete. Test prompts get regenerated. Otherwise the manifest update is cosmetic — the model sees it and the prior context overrides it.

Contract Tests for Tools: What Actually Works

Static schema validation catches surface drift. It catches nothing else. The pattern that works in production is a contract-test suite that runs against the live backend on every deploy of either the manifest or the endpoint, and alerts when the manifest's claims and the backend's reality disagree.

Three probes cover most of the drift surface:

Live schema probe. For each tool in the manifest, call the endpoint with a canned, idempotent input and compare the response envelope — field names, types, required-ness, nesting — against what the manifest promises. This catches surface drift the moment it ships. It's cheap. Teams skip it because it feels redundant with OpenAPI, but OpenAPI documents intent; the probe documents reality. When they diverge, the probe tells you which one is lying.

Response-shape validation at call time. At runtime, every tool response gets parsed through the manifest's declared schema with strict mode on. Unknown fields, missing fields, or type mismatches get logged with tool name and tenant. You're not trying to block the call — you're trying to light up a dashboard the moment reality diverges from the contract, so drift gets caught in hours instead of quarters.

Behavioral drift detection per tool. For each high-stakes tool, keep a set of golden input/output pairs — not exhaustive, just a few representative cases. Replay them nightly against the live endpoint. If the response values (not just shapes) drift outside an acceptable delta, alert. This is the only probe that catches semantic and behavioral drift, and it's the one that actually saves you when an upstream team silently changes a unit from cents to dollars. It requires someone to curate the pairs, which is why most teams skip it. They shouldn't.

These three probes together form the equivalent of consumer-driven contract testing for agents: the agent's manifest is the consumer contract, and the probes verify the backend still honors it. When a probe fails, you fix it before the agent quietly produces a thousand subtly wrong outputs in production.

One discipline matters more than tooling: treat tool descriptions as versioned contracts, not as documentation. A description change is a breaking change if it shifts the agent's selection logic, even when the JSON schema is identical. Pin descriptions per version. Diff them in CI. Require explicit approval to change them. A one-word edit to a description can double the agent's call rate on that tool or redirect traffic to a competitor tool in your manifest — and the model will not tell you it happened.

The Manifest Ownership Problem

The technical story is the easy half. The hard half is organizational.

The team that writes the tool description is almost never the team that maintains the backing endpoint. Platform teams own the endpoints. AI teams own the manifests. The description lives in the AI team's repo, phrased in language optimized to steer an LLM, while the endpoint lives in the platform team's repo, governed by whatever release process they've had since long before anyone considered "prompting a model" a deploy-time concern.

Most backend teams have no idea their endpoint is being consumed by an agent, let alone that the free-text description of their API in someone else's repo is load-bearing product logic. So they ship a perfectly routine change — tighten a validation rule, rename an internal field, adjust a default — with no awareness that the change has broken a contract nobody told them existed.

This is where drift incidents come from. Not malice, not incompetence. Structural invisibility.

Three organizational practices close the gap:

  • Declare tool ownership at manifest registration. Every tool in the manifest names a primary maintainer and a backing service. When a backing service ships a change, its CI pipeline runs the contract-test suite for any manifest that depends on it. The platform team learns about agent consumers the same way they learn about any other downstream consumer — through a test that fails in their pipeline.
  • Review descriptions like code. A tool description is a prompt fragment that changes model behavior. Treat PRs that modify descriptions the same way you treat PRs that change a production prompt: paired review, behavioral eval on a held-out set, version bump, deployment log. The moment description changes stop being reviewed, an intern can silently redirect your agent's traffic.
  • Run behavioral evals on description changes, not just schema changes. Before merging a description edit, run the agent against a representative task set and compare call-rate, argument quality, and downstream outcome metrics against the previous version. If the description is more persuasive, the agent will call the tool more — that is often not what you want.

None of this is glamorous. It's the boring integration work that separates agent prototypes from systems you can trust with production traffic.

What to Do This Week

If you operate an agent system with more than a handful of tools, you probably already have manifest drift. The question is how much.

A concrete first week's worth of work:

  • Enumerate every tool your agent can call. For each, identify the backing endpoint and the team that owns it. The gaps in that list are your blast radius.
  • Pick your five highest-stakes tools. Write the three probes (live schema, response-shape logging, golden-pair behavioral replay) for each. Wire them into CI so both the manifest repo and the backing-service repo trigger them.
  • Turn on response-shape validation in production for all tool calls. Don't block on failures yet — just log and dashboard. Within a week you'll know which tools are drifting and roughly how badly.
  • Audit your few-shot examples, retrieval corpora, and stored trajectories against the current manifest. Anything using a renamed field or deprecated enum value needs to be regenerated — otherwise your next manifest update will be cosmetic.
  • Require that every description change in the manifest ship with a behavioral eval delta. If you can't measure the change in agent behavior, you don't know what you shipped.

The goal is not to prevent drift. Schemas will change; that's healthy. The goal is to ensure that when they change, the agent notices — before your users do.

References:Let's stay in touch and Follow me for more thoughts and updates