Skip to main content

Why Your Agent Works in Dev and Panics in Prod

· 10 min read
Tian Pan
Software Engineer

The agent demo always works. Three customers in the table, one matching record, twelve documents in the vector index, an empty calendar with infinite open slots. The agent picks the right row, retrieves the right document, books the right meeting. Ship it.

Then production hands the same agent ten million customers with three "John Smith"s in the same city, a filter that returns four thousand rows because the agent confidently wrote status != 'closed' when it meant status = 'active', a vector query that returns seven plausible documents the agent has never had to choose between, and a calendar where every slot is a negotiation. The capability that looked correct in dev is qualitatively different in prod — not slightly worse, not flakier, but solving a different problem the dev environment never made it solve.

This is the gap that "it worked locally" hides. For deterministic code, that phrase is already a lie about edge cases. For agents, it is a stronger lie, because the agent's behavior is a function of input distribution, and the input distribution shifts from "trivial" to "ambiguous" the moment you cross the prod boundary.

Sparse dev hides the only test that mattered

The standard dev setup is built for fast iteration. You seed three rows because three rows are enough to make the UI render and the SQL run. You drop twelve documents into the vector index because that's what the README example uses. You create a fake calendar with three meetings because nobody on the team wants to wait for a real one.

Every one of these decisions is correct for engineering velocity and wrong for agent evaluation. The agent's reasoning surface is the cardinality and ambiguity of its inputs, and you have just hidden both. Asking the agent "find the customer named John Smith" against a fixture with exactly one John Smith is not a test of disambiguation — it's a test of whether the agent can call the tool at all. Asking it "find the relevant policy document" against twelve documents on three distinct topics is not a test of retrieval reasoning — it's a test of whether the agent can construct any query.

You watch the agent succeed and conclude the capability is built. What you have actually built is a capability that works in the regime your fixtures encode. In dev, every match is unique, every filter is narrow, every retrieval is unambiguous. In prod, none of those assumptions hold, and the agent encounters a problem shape it has never been asked to solve.

The asymmetry is what makes it hard to see. Deterministic code fails the same way at one row and at a million; an off-by-one is an off-by-one. An agent's behavior at one row and at a million rows can be qualitatively different — confidently correct against the small set, confidently wrong against the large one — because what changes between them is not the code path but the reasoning the agent must perform once the inputs stop being unambiguous.

Production density is a different problem class

Walk through what a typical agent tool actually does at production cardinality.

A "find customer" tool that returned one row in dev now returns 47 rows in prod because three people share the name and twelve more match on the fuzzy substring the agent decided to use. The agent has no fixture experience of "what do I do when the result set is larger than I expected" because in dev the result set was always exactly one. It either picks the first arbitrarily, confidently presents 47 to the user as if all are correct, or loops on increasingly narrow filters that exclude the right answer.

A vector-search tool that returned three highly relevant documents in dev now returns the top-K of a much denser distribution where the top three differ from the top seven by a similarity score the agent cannot interpret. In dev there was nothing to interpret — the relevant doc was visibly the relevant doc. In prod the agent has to reason about which of seven plausible documents to trust, and that reasoning was never exercised in fixtures.

A "find an open slot" tool that succeeded trivially against an empty calendar now surfaces a hundred conflicted slots, each with different stakeholders, different priorities, and different rescheduling costs. In dev, "find a slot" was a lookup. In prod, it is a negotiation. The agent's prompt was tuned for the lookup.

A "filter records by status" tool returned 30 rows in dev where every status was relevant. In prod it returns 40,000 rows, the agent quietly truncates to the first page because the context budget would explode otherwise, and the answer it computes is now a function of an arbitrary database sort order the agent has no idea exists.

None of these are bugs in the model. They are the model encountering an input distribution your fixtures promised would never arrive. The agent's capability boundary is a function of cardinality and ambiguity, not of the abstract task name, and dev environments systematically hide both.

"It worked locally" is structurally a stronger lie for agents

For deterministic code, "it worked locally" usually means "we missed an edge case." The fix is a unit test against the edge case, and the code is the same code regardless of input scale. The reliability gap closes once you remember to write the test.

For agents, "it worked locally" means something deeper. It means the model's reasoning has not been challenged at the dimension where production actually challenges it. The fix is not a missing test case — it is a missing test regime. You cannot patch dev sparseness with one extra fixture, because the failure mode is not a specific input but a distribution shift across all inputs.

The corollary is that an agent passing its dev test suite is not weak evidence that it will work in prod — it is actively misleading evidence, because the suite has selected for the regime where the agent always succeeds. A green test signal trained on sparse fixtures means "the agent handles the easy cases," which you could have predicted without running anything. The information you actually need is how the agent behaves at production density, and the suite was not designed to produce that information.

This is also why "we'll add a prod test later" is the wrong sequencing. The behavior at production density is qualitatively different, so the agent you shipped is, in a meaningful sense, not the agent you tested. The right framing is that the agent does not exist as a deployable artifact until it has been exercised against the input distribution it will serve. Anything before that is a prototype.

What production-shaped state actually looks like

Closing the gap is not about a denser fixture file. It is about treating the production distribution as a first-class input to evaluation. Concretely:

  • Production-shadow datasets: seed the test environment from a sanitized snapshot of production state, not from synthetic fixtures. The customer table has the cardinality, the name-collision rate, and the dirty-data tail of the real one. The vector index is sized like the real one. The calendar has the conflict density of the real one. The agent's tools return what they would actually return.
  • A density axis in eval: score the same agent capability at multiple cardinalities — 10, 1,000, 100,000 rows — and graph behavior across the axis. Degradation that is invisible at the dev fixture size becomes visible the moment you draw the curve. A capability that holds steady is provably robust; one that collapses past a threshold tells you exactly where the agent's reasoning stops generalizing.
  • Ambiguity-injection fixtures: build the fixture set so that ambiguity is the default, not a special case. Multiple matches, near-duplicate documents, conflicted slots, partially-relevant rows. The agent should have to disambiguate in every test, because it will have to disambiguate in every production call. A fixture with a unique correct answer is a fixture that didn't test anything.
  • Shadow traffic as a release gate: run the agent against live production state without executing its side effects. Compare what it would have done against what was done. The gap is your release signal; the cases where the agent's choice diverges are your eval set for the next iteration. This is the discipline several production teams now use before turning agents loose on irreversible actions, and it is the only way to measure behavior at the distribution that matters.
  • An agent-readiness gate, not an agent-correctness gate: don't ship on "the test suite passes." Ship on "the agent has been observed against state shaped like the target environment, at the cardinality of the target environment, and its failure modes there are characterized and bounded." The gate is about whether you have made the production observation, not about whether dev came back green.

The pattern under all of these is the same: stop testing what is convenient to test, and start testing what production will actually do. The instrument for that is fixture density and shadow traffic, not more unit tests.

Decide whether to ship in production's regime, not dev's

The architectural realization sits underneath all of this. An agent's behavior is a function of the distribution of inputs it sees. Dev environments are a controlled sample from a distribution the agent will never serve. Prod is the distribution. The eval suite needs to live in the second world, not the first.

This means treating fixtures as a representation of the production environment, with their own drift, their own freshness requirements, and their own decay. A six-month-old fixture set is not the same problem the agent now faces in prod, because the production distribution has shifted underneath it. The fixture corpus needs the same ops discipline as the agent: dated, versioned, re-derived from current state, retired when stale.

It also means that team velocity arguments against production-shaped testing are arguing the wrong direction. The cost of running an agent against sparse fixtures is not zero — it is the cost of incidents that happen in week one of production because the agent encountered the input distribution for the first time on a real user. The dev-fast-prod-broken loop is not a velocity gain; it is borrowed time, repaid in incident postmortems and customer-trust burn.

The practical bar is simple: before the agent goes live, you should be able to point at the regime in which it has been observed and say, "this is what production looks like, and the agent's behavior here is acceptable." If the only regime you can point at is dev fixtures, you have not yet evaluated the agent — you have evaluated a different agent that runs against a different problem. Ship the one you've actually tested, or test the one you actually plan to ship. There is no third option that doesn't end with a panicked rollback.

References:Let's stay in touch and Follow me for more thoughts and updates