Skip to main content

The Hidden Edges Between Your AI Features: When One Prompt Edit Regresses Three Other Teams

· 9 min read
Tian Pan
Software Engineer

A platform engineer changes the opening sentence of the company's "house style" preamble — a single line that anchors voice across customer-facing assistants. The change ships behind a flag. By Tuesday, the search team's relevance regression has spiked, the support bot's eval pass-rate has dropped four points, and the onboarding agent's retry rate has doubled. None of those teams touched their own code. None of them got a heads-up. The platform engineer has no idea any of this happened, because nobody was on the receiving end of an alert that said "your edit just broke three downstream features."

This is the failure mode that defines the second year of an AI org's life. The first year, every team builds its own thing in a corner. The second year, those corners start sharing artifacts — a prompt fragment here, a seeded eval set there, a tool schema reused as a contract — and the moment that sharing becomes implicit, the dependency graph between AI features becomes invisible. You now have a distributed system whose edges no one can name.

The discipline that fixes this is not a new platform. It's drawing the graph.

The artifacts your AI features actually share

When engineers talk about service dependencies, they mean APIs. Service A calls service B's /v1/users endpoint, that fact is tracked in an API gateway, the dependency is one line in a service catalog, and a breaking change forces a versioned migration. The graph is observable.

AI features compose through artifacts that nobody catalogs. The artifacts you should have already drawn on a wall include:

  • Prompt fragments: a "house style" preamble, a "be concise" directive, a refusal template. These get pasted into multiple system prompts. The copy is independent — each feature has its own — but the source of truth is informal, often a Notion page or a Slack thread.
  • Eval seeds: feature B's eval set was bootstrapped by sampling feature C's production traces a quarter ago. Feature B now thinks it owns its golden dataset; in fact its calibration is tethered to feature C's behavior at a point in time.
  • Tool schemas: feature A reuses feature B's intent classifier as a tool. The classifier's JSON schema is the contract. When B refactors a parameter name, A's tool description shifts and its agentic loop gets noisier.
  • Judge rubrics: the LLM-as-judge prompt your evals call was forked from another team's a year ago. Both copies have drifted. Both anchor on specific phrases that the model under test sometimes parrots.
  • Embeddings: feature D's retrieval index uses an embedding model that another team committed to in production. When that team upgrades, D's neighborhoods rotate.
  • Traces: half of your eval coverage is a snapshot of last quarter's production traffic. The product changed. Your evals didn't.

The 2015 paper on hidden technical debt in ML systems names this pattern: artifacts get undeclared consumers. The output of one piece of the system is silently used as input to another. The 2026 version of that paper would just substitute "prompt fragment" or "eval seed" for "model output." Same failure, new artifacts.

How the edges form, and why they stay invisible

The edges between AI features don't form through a design review. They form through three completely human shortcuts.

The first is the copy-paste. An engineer on team A is writing a system prompt and remembers a clean preamble that the assistant team uses. Copy, paste, ship. There's now an edge between A's prompt and the assistant's preamble — but the edge is just text, and text doesn't have a graph node.

The second is the "borrowed from" message. Team B is bootstrapping evals and DMs team C: "Hey, can I grab a few hundred of your traces to seed our golden set?" Team C says sure. The edge is now in a Slack thread that will fall out of search in six months. Team B's eval calibration is anchored to team C's distribution at one moment in time. When team C changes their feature, team B's evals don't shift — they freeze, and the gap between the eval set and reality silently widens.

The third is the convention drift. The org has a "voice and tone" doc that everyone is supposed to apply. Three teams interpret it slightly differently when they write their preambles. Six months in, those three preambles are now three forks of the doc, drifting independently, and a new feature reads one of them as "the canonical version" because it happens to be linked from the most recent onboarding deck.

Each of these shortcuts is fine in isolation. The problem is that all three are happening, in parallel, across every team, every quarter. The graph that results is real — every edge has measurable behavioral consequences — but it lives in nobody's head, nobody's catalog, and nobody's CI.

The CACE principle from the 2015 paper — changing anything changes everything — was originally a statement about ML model entanglement. In an LLM-era stack, it generalizes to the artifact graph. Touch a shared artifact, change every feature that depends on it. The difference is that with model entanglement, at least the model was an explicit object you could version. The shared artifacts in an AI feature org are mostly text, and text has no version unless someone writes one.

The "undeclared consumer" failure mode in production

What does the failure look like? It looks like a regression you can't attribute.

A senior engineer on the assistant team ships a "small wording cleanup" to the house-style preamble. Two days later, the support bot's escalation rate ticks up. The support team investigates. Their model hasn't changed. Their prompt hasn't changed. Their evals are green. Their tool calls are firing normally. They spend a week chasing it, eventually shrug, and chalk it up to "model variance." Three weeks later, they shave a feature to recover the metric.

The dependency was real. The wording cleanup shifted how the model interpreted a downstream instruction that referenced a phrase the preamble used to anchor. The downstream prompt didn't quote the preamble — it just relied on the preamble having shaped the model's prior. The fix would have been a one-line revert. Nobody knew to revert. The week of investigation cost more than the team that owns the preamble even spent writing the original change.

Multiply this incident by every quarter. The visible cost is the engineering time spent debugging unattributable regressions. The invisible cost is the chilling effect: teams stop refactoring shared artifacts because they can't predict the blast radius. Now your prompt library is encrusted with dead language because every cleanup is too risky to ship.

This is the same dynamic the 2015 paper warned about. Undeclared consumers radically increase the cost of changing the source artifact. Eventually, the artifact ossifies. The org loses the ability to evolve its own prompts.

The discipline that has to land

Drawing the graph is a five-part discipline. None of these steps require a new platform. They require treating shared artifacts the way you already treat shared services.

1. Catalog the artifacts as first-class. Every prompt fragment, eval seed, tool schema, judge rubric, embedding model, and shared trace dataset gets a name and a row. Each row has an owner. The catalog can be a Notion table for the first quarter; the form matters less than the fact that it exists. The point is that you can answer "who owns this preamble" with a name, not a Slack search.

2. Name the edges. When team A reuses team B's tool as a dependency, that edge is a contract that somebody owns. The contract specifies what A relies on (this parameter exists, this output shape, this behavior) and what B is allowed to change without coordination. This is consumer-driven contract testing applied to AI artifacts. The edge is the contract, and the contract has a maintainer.

3. Run a CI check on shared artifact changes. When a shared artifact is modified, the CI step looks up its consumers and fails the PR if a consumer's eval suite hasn't been re-run against the new version. This is the closest analog to a service-mesh integration test. It's also the place where modern prompt management tooling earns its budget — the tools that exist to version, test, and compare prompts are useful precisely because they make the cross-consumer regression check executable.

4. Add a dependency review to the AI architecture process. When a team ships a new AI feature, the review asks the same question it asks for a microservice: what depends on you, and what do you depend on? The team draws the artifact edges before the launch. The edges go into the catalog. The launch is conditional on the edges being named.

5. Audit the graph quarterly. Edges decay. Evals seeded from old traces stop being representative. Forked rubrics drift apart. The quarterly review walks the graph, flags edges where the source has changed without the consumer re-baselining, and schedules the cleanup work. This is the AI-era equivalent of a service ownership review.

The investment is real — somebody has to maintain the catalog, write the CI, and run the reviews — but the alternative is the regression-attribution tax compounding every quarter.

What you actually ship when you draw the graph

The architectural realization underneath all of this is that AI features compose through artifacts the team doesn't track. Prompts, evals, traces, embeddings, judge rubrics. Every one of those is a place where an undeclared consumer can latch onto an undeclared producer, and every one of those edges is invisible to your service catalog because your service catalog still thinks dependencies live in HTTP calls.

The org that hasn't drawn the artifact graph is one shared edit away from a multi-feature regression nobody can attribute. The org that has drawn it gets two things at once: it can evolve its shared artifacts without breaking downstream features, and it can answer the postmortem question "why did three features regress at once" with a name and a diff instead of a shrug.

Start small. Pick the three artifacts that get reused most often — usually the house-style preamble, the LLM-as-judge rubric, and one or two heavily-shared tool schemas — and name their consumers. The first time you find an edge nobody knew existed, you'll know why this matters. The first time CI catches a shared-artifact change before it lands, you'll know it's working.

The graph already exists. The only question is whether it lives in your catalog or in the postmortem.

References:Let's stay in touch and Follow me for more thoughts and updates