Skip to main content

AI Feature Dependency Graphs: When a Prompt Edit Is a Silent Breaking Change

· 12 min read
Tian Pan
Software Engineer

A team owns a summarizer. Another team owns the search ranker that ingests those summaries. A third team owns a router that picks between agent personalities based on the ranker's confidence score. None of these teams have a shared on-call rotation, none of them sit in the same standup, and the only contract between them is "the previous feature's output is the next feature's input." On a Tuesday, the summarizer team tightens a prompt to fix a hallucination complaint from a sales demo. The search ranker's quality collapses six hours later. The router starts handing off to the wrong agent personality by Wednesday morning. The post-mortem will record the cause as "prompt change," but the actual cause is that the team's AI features have quietly composed into a directed graph that nobody drew.

This is the most common shape of an AI outage that doesn't trip any of the alerts you built for AI outages. The model isn't down. The eval suite for the changed feature is green. The token cost line is flat. What broke is the interface between two features, which is a thing your dependency tooling treats as plain text because that's all it is at the API boundary — and treats as inert because plain text doesn't carry a version, a schema, or a deprecation policy.

The reason this category of failure has been slow to get respect is that it looks, on first inspection, like a regular regression. A change went in; quality dropped somewhere. The natural reflex is to blame the change author or the downstream owner. Both reflexes are wrong. The change author shipped a fix that was correct under their own eval surface. The downstream owner is being held responsible for a fixture they never agreed to maintain. The actual broken thing is the absence of a contract.

The Implicit Graph You Already Have

Walk through a typical AI product surface and count the consumption relationships. A classifier feeds a router. A summary feeds a search index. An agent's plan step feeds its execution step, and the execution step's tool outputs feed the next reasoning turn. A retrieval prompt's outputs feed a generation prompt's context. A safety prompt's verdict feeds a post-processing rewrite. Each arrow is a place where one feature's output distribution is another feature's input distribution.

In a traditional services architecture, this graph is visible. The protobuf file is in version control, the service mesh logs the call, the SLO dashboard names the consumer, and a breaking change to the schema fails CI because the consumer's generated client refuses to compile. In an AI product, the same graph exists, but every arrow is a free-form string. The output schema is whatever the model emitted last week. The consumer's parser is a regex or a JSON schema with additionalProperties: true. The "breaking change" surface area is the entire posterior distribution of the upstream model, not the column names in a database.

This is why the prompt-edit-as-breaking-change problem is harder than the service-API-as-breaking-change problem, not easier. A protobuf has roughly five ways to break and they are all enumerable by a linter. A natural language output has thousands of ways to drift — average length, refusal rate, hedging frequency, citation density, the implicit ordering of named entities, the number of bullet points in a list, the verb tense of action items. Each of these distributional axes is something the downstream consumer's prompt may be quietly relying on, often without the downstream team being aware of the dependency themselves.

Recent work on distribution shift in deployed LLM systems has reported performance drops of more than seventy percent under moderate shifts in prompt behavior, and the deployments that surface this most painfully are exactly the multi-stage pipelines where one feature's outputs are another feature's inputs. The drop isn't a model regression. It's a composition regression. The model is doing its job. The pipeline isn't.

The Manifest Has to Be Explicit

Pipelines this fragile only become tractable when the dependency graph stops being implicit. The most lightweight version of this is a manifest — one file per feature that declares two lists: "I consume the output of these features" and "I am consumed by these features." It is not glamorous, and it does not have to be a sophisticated artifact. A YAML file checked into the repo that names upstream sources and downstream consumers is enough to do the thing that the manifest is actually for, which is to convert a prompt edit from a one-team decision into a multi-team coordination event.

The manifest is the place where a PR template can ask, mechanically, "who consumes this feature's output, and have they reviewed this change?" The manifest is the place where a release process can refuse to merge until the named consumers acknowledge the prompt diff. The manifest is the place where, when the post-mortem asks "did anyone know that the ranker depended on the summarizer's average length staying under forty words," the answer is either "yes, it was in the manifest" or "no, and that's the gap we're closing now."

Three properties of a manifest that earn their keep:

  • It names features, not files. A feature is a contract: a user-visible behavior that a single team owns. A prompt file is an implementation detail of that contract. Naming features rather than files keeps the manifest stable across refactors where a single feature gets split into multiple prompts or vice versa.
  • It declares the output surface the consumer relies on. Not the full prompt — the distributional properties the downstream prompt is sensitive to. "Summary length under sixty words." "Output is exactly one JSON object with these keys." "Confidence score is calibrated to the 0.0–1.0 range, not a 1–5 ordinal." The manifest is the answer to the question "what about the upstream output would break me if it changed."
  • It is the input to release coordination. Without the manifest, a prompt PR is a one-team artifact and the consumers find out from production. With the manifest, the PR template lists the consumers and either auto-tags them as reviewers or requires their sign-off via a check.

Contract Tests That Run Against Reality, Not Fixtures

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates