Skip to main content

Multi-Axis Agent Bisection: When the Regression Lives in the Interaction

· 11 min read
Tian Pan
Software Engineer

Quality regressed overnight. The on-call engineer pulls up the dashboard, traces a few bad sessions, and starts the obvious bisection: the model provider rotated to a new snapshot at 02:00 UTC, so revert to the pinned older alias. Eval suite still red. Roll back yesterday's prompt change. Still red. Pin the retrieval index back to last week's version. Still red. Each owning team rolls back their own axis in isolation and reports "not us." Three hours in, nobody owns the diagnosis because nobody owns the interaction surface where the regression actually lives — the new model interpreting the new tool description in a way the old model never would have.

This is the failure mode single-axis tooling can't solve. git bisect works because the search space is one-dimensional: a linear sequence of commits. An agent doesn't have one timeline. It has four or five timelines running in parallel — model snapshot, system prompt, tool catalog, retrieval index, sampling config — each with its own owner, its own deploy cadence, and its own "rollback" button that returns just its axis to a known state. The regression you're chasing is often a two-factor interaction, and bisecting along any single axis returns false negatives because the bug only fires on the cross-product cell where the new model meets the new tool description.

The depressing version of this story is that the team eventually finds the cell by accident — usually after rolling back two axes simultaneously for an unrelated reason, watching the eval go green, and reasoning backward from there. The disciplined version requires treating the agent as a multi-dimensional release artifact from day one, not a stack of independently-released pieces that happen to share a runtime.

Why single-axis bisection lies

Modern agents have at least four versioned layers in flight at any given moment, and most teams have more:

  • Model snapshot — the actual weights behind a claude-opus-4-7 or gpt-5.5-turbo alias, which the provider rotates on a schedule the customer doesn't control.
  • System prompt — edited continuously by whoever owns the agent's behavior, sometimes via a prompt registry, sometimes via a commit to the application repo.
  • Tool catalog — the set of tools the agent can call, plus their descriptions, plus their schemas, plus the implementation behind each one. In an MCP world this includes whatever the upstream MCP server decided to ship in its latest release.
  • Retrieval index — the corpus, the embedding model, the chunking strategy, the re-ranker, and the index version. Any of these can be re-built on a separate cadence.
  • Sampling config — temperature, top-p, max tokens, structured output schema, stop sequences. Looks innocuous, but a tightening from temperature 0.7 to 0.3 changes which branches the planner takes.

When the regression isn't in any one of these layers but in the combination, single-axis rollback gives you a clean false negative on every axis. The model team rolls back the model: still red. The platform team rolls back the prompt: still red. The retrieval team rolls back the index: still red. Each team correctly concludes "the bug is not in my axis." The bug is in the edge between two axes — the place no team owns.

This is exactly the structure of a two-factor interaction in factorial experimental design. The marginal effect of changing model A → B is small. The marginal effect of changing tool description X → Y is small. The interaction effect of doing both at once is large. ANOVA on a factorial dataset would surface it immediately. A series of single-axis rollbacks never will, because each one holds the other changed axis fixed at its new state.

The version envelope: what every agent run must record

The minimum viable instrumentation is to record, for every agent run, the full version envelope as structured fields. Not "the agent" — the agent's entire dependency closure, named and pinned at the moment of execution:

  • model.alias — what the caller asked for (e.g. claude-opus-4-7)
  • model.snapshot — what the alias resolved to at request time (e.g. claude-opus-4-7-20260415)
  • prompt.commit — the SHA of the system prompt source, or the version number from the prompt registry
  • tool_catalog.manifest_hash — a content hash over the full set of tool names, descriptions, and schemas the agent saw
  • retrieval.index_version — the version of the index that served the retrieval call (and ideally the embedding model version separately)
  • sampling.config_hash — temperature, top-p, max tokens, response_format, stop sequences

The non-negotiable detail here is the model snapshot, not the alias. Aliases like claude-opus-4-7 rotate on the provider's schedule, and a session that ran yesterday at 23:59 and one that ran today at 00:01 may have hit different weights with the same alias string. Resolving the alias to a snapshot at request time and logging the resolved value is the only way to make "the regression started after the model rotated" a falsifiable claim instead of a hypothesis.

Once the envelope is being recorded, two new capabilities become possible. First, you can filter — give me all sessions where tool_catalog.manifest_hash = X AND model.snapshot = Y and look at quality across that cell specifically. Second, you can freeze — pin any subset of axes to known values during investigation, holding everything except the suspect axis constant. Without the envelope, neither is possible, because you don't even know which cell of the cross-product each historical session belonged to.

Bisecting the cross-product, not the timeline

Once you have the envelope, the bisection algorithm changes shape. Instead of "binary search along the commit history," it becomes "search along the cross-product of axes, using an eval suite as the predicate."

A reasonable sketch:

  1. Define the axes you'll search across (typically four or five from the envelope above).
  2. Identify, for each axis, the last known good value and the current value. If you can't, you're already in trouble — that's a missing snapshot store, fix that first.
  3. Run the eval suite at all $2^kcornersofthecrossproduct(orafractionalsubsetifcorners of the cross-product (or a fractional subset ifk$ is large). With four axes that's 16 cells; with five it's 32. Run-cost is real but bounded.
  4. The cells where eval scores collapse are the interaction cells. The pair of axes whose values changed simultaneously across red cells is the interaction.
  5. Once the interaction is named ("model B + tool description Y is the bad cell"), the team that owns either axis can fix it — by reverting one side, by adjusting the tool description to be unambiguous under the new model, or by adding an eval case that locks the interaction down.

The full $2^ksweepisoverkillformanyinvestigationsoftenasinglesuspectinteractionisobviousfromthetrace,andyouonlyneedtoconfirmitwitha2×2.Butthepointistheshapeofthesearch:acrossproduct,predicatedonaneval,withenvelopeversionsastheunitofvariation.Singleaxisbisectionisjustthedegeneratecasewheresweep is overkill for many investigations — often a single suspect interaction is obvious from the trace, and you only need to confirm it with a 2×2. But the point is the *shape* of the search: a cross-product, predicated on an eval, with envelope versions as the unit of variation. Single-axis bisection is just the degenerate case wherek = 1$.

The classic git bisect run shape works perfectly for the predicate. The "is this build good or bad" script becomes "run the eval suite at this envelope corner; exit 0 if score above threshold, exit 1 otherwise." LWN documented automated git bisect over a decade ago and the discipline transfers cleanly — the only twist is that the search space is no longer a single linked list of commits.

The blame budget: making the search space tractable

The cross-product blows up fast. Five axes is 32 cells; six is 64; seven is 128. If your team is rotating model snapshots, prompts, tool descriptions, retrieval indexes, sampling configs, and a system message that gets templated with user-tier-dependent content — you have an exponential search space and a finite eval budget.

The discipline that makes this tractable is a per-axis blame budget: a cap on how many axes are allowed to drift between two consecutive eval runs. Concretely, a release-engineering rule that says "no more than two axes change between scheduled eval runs, and the eval runs every 24 hours." Now the cross-product the on-call engineer has to search is at most $2^2 = 4 cells, not \2^5 = 32$. The bug, if it exists, is constrained by construction to live in a small neighborhood.

This is the agent equivalent of "small commits, frequent merges" from traditional software engineering — and for the same reason: bisection only works when the unit of change is small enough that the search space is small enough to actually search. Teams that ship four prompt edits, a model rotation, a new MCP server, and a re-indexed retrieval corpus on the same day are not shipping; they're gambling.

The corollary is that the named, pinned envelope — not the individual commit on any single axis — is the right unit of rollback. "Roll back to envelope v47" means restoring all four or five axes to the values they had under that named version. "Roll back the prompt" without restoring the model snapshot it was tested against is rolling back to a state that has never been jointly evaluated and may itself be a new bad cell.

Owning the interaction surface

The org failure mode is the deepest one. Each axis has an owner. The model gets pinned by the platform team. The prompt is owned by the agent team that built the feature. The tool catalog comes from a mix of internal and third-party MCP servers. The retrieval index is owned by the search team. The sampling config is in the agent team's runtime. Nobody owns the cross-product, which means when a regression fires in an interaction cell, the diagnosis falls into the gap between teams.

The fix is organizational: name a person or a team that owns the envelope, not any particular axis. That role's job is to maintain the version-envelope schema, run the periodic cross-product evals, gate the blame-budget rule, and own the bisection runbook when an interaction-class regression fires. Without a single owner for the envelope, every regression has at least one round of "not my axis" before anyone starts looking at the interaction.

The model team will still own model rotations. The agent team will still own prompts. But somebody has to own the coupling — the surface where a model snapshot's quirks meet a tool description's ambiguities meet a retrieval re-rank's distribution shift — because that surface is where production regressions actually live, and it is invisible to any single-axis dashboard.

What this means in practice

Three things to do this quarter:

  1. Add the version envelope to your trace schema. Every agent run logs model snapshot (resolved, not aliased), prompt commit, tool catalog manifest hash, retrieval index version, and sampling config hash. If you can't filter sessions by these fields today, that's the first thing to fix.

  2. Make envelopes the unit of rollback. Tag and store named envelopes. "Promote envelope v47 to production" replaces "deploy this prompt change." Rollback restores the joint state that was last known good, not whatever happened to ship most recently on each axis.

  3. Cap the blame budget. Two axes per eval cycle, no exceptions. If three axes need to change, run the eval twice and ship the changes in two cycles. The exponential search space is the enemy; constraining drift is what keeps bisection in the realm of "an afternoon" rather than "a sprint."

The deeper claim is that agentic systems have inherited the multi-version nightmare that traditional distributed systems solved twenty years ago — only now the versioned components include the model weights, which the team didn't write and can't read. The teams that ship reliably are the ones who treat that opaque, rotating, externally-owned dependency the same way SRE treats a third-party API: pin it, version it, log the resolved snapshot, and never let "we upgraded everything at once" be the deploy strategy.

The regression isn't in any single layer. It's in the interaction — and the team that doesn't have tooling to search the interaction surface is the team that, an hour into the next incident, will be reading four green dashboards and a red eval, with no idea where to look next.

References:Let's stay in touch and Follow me for more thoughts and updates