Skip to main content

The Prompt Ownership Problem: When Conway's Law Comes for Your Prompts

· 11 min read
Tian Pan
Software Engineer

Every non-trivial AI product eventually develops a prompt that nobody is allowed to touch. It has three conditional branches, two inline examples pasted in during a customer-reported incident, and a sentence that begins with "IMPORTANT:" followed by a tone instruction nobody remembers writing. The prompt is 1,400 tokens. The PR that last modified it was reviewed by an engineer who has since changed teams. When a new model comes out, nobody is confident the prompt will still work. When evals regress, nobody is sure whether the prompt, the model, the retrieval pipeline, or a downstream tool caused it. The string is shared across four services. Every team has a local override. None of the overrides are documented.

This is the prompt ownership problem, and it is the single most under-discussed failure mode in multi-team AI engineering. It is not a technical problem. It is Conway's law reasserting itself at the token level. An organization's prompts end up mirroring its org chart, its RACI gaps, and its coordination tax — and the model, which does not care about your Jira hierarchy, produces correspondingly incoherent behavior for end users who do not care either.

The reason prompts drift into no-man's-land faster than code is that they sit at the intersection of three disciplines — behavioral design, model reasoning, and evaluation — each of which is usually staffed by a different team with different metrics and different review processes. A prompt is simultaneously a product surface (what the assistant should say), a modeling artifact (how the model should be guided), and a regression risk (what your evals will light up when it changes). When one role writes, another deploys, a third evaluates, and a fourth fields the support tickets, slow quality leaks go undetected because no single person owns the whole loop. By the time somebody notices, the accountability is diffuse enough that "who broke this" becomes an archaeological question.

Why "the prompt works for my team" is load-bearing excuse

The most revealing sentence in any prompt review is: "It works fine for my team's use case." What this sentence is really saying is that the speaker has tested the prompt against a distribution of inputs, expectations, and downstream consumers that matches their team's slice of the product — and is implicitly asking other teams to accept that evidence as general-purpose validation.

It almost never generalizes. A support assistant prompt that works for billing-related questions may fail for refund flows because the tone instructions were tuned against complaint-resolution data, not policy-disclosure data. A summarization prompt that works for engineering changelogs may produce dangerously confident summaries of legal memos. A prompt that scores 94% on one team's eval set may score 71% on another's, not because the prompt is worse, but because the distributions are different and the eval sets were authored by people optimizing for different tails.

The phrase works because it is locally true, socially defensible, and impossible to disprove without running the eval again on the other team's data. And most organizations do not have a shared eval infrastructure that makes that disproof cheap. So the prompt ships, a silent regression lands for some other team's users, and the failure shows up three weeks later in a support ticket that gets triaged to yet a fourth team.

Every multi-team AI org needs a standing answer to this sentence. The answer is not "prove me wrong." The answer is a shared eval harness where any stakeholder can point their own representative fixtures at a candidate prompt change before it lands. Without that, "works for my team" wins every argument by default.

The RACI gap: four roles, zero owners

A useful exercise is to map the actual lifecycle of a production prompt and ask, for each stage, who is Responsible, Accountable, Consulted, and Informed. In most orgs, the answer breaks down something like this:

  • Authorship. A product engineer or prompt engineer writes the initial draft, usually inside a feature branch of application code. They are responsible. Nobody is accountable yet, because the prompt has not landed.
  • Review. The PR gets reviewed by a peer, who checks that the code change compiles and the prompt is not obviously offensive. The reviewer is rarely a prompt specialist. They are not accountable for behavioral regressions; they are accountable for the diff.
  • Deployment. A release engineer ships the build. They do not know the prompt changed. They are accountable for availability, not output quality.
  • Evaluation. If the org has an eval team, they might run the next scheduled eval batch on the release. By the time they flag a regression, the change has been in production for days. They are informed, not accountable.
  • Incident response. When something goes wrong, a support ticket lands, a triage engineer investigates, and a small war room forms. The war room has to reverse-engineer who owns the prompt in order to fix it.

The pattern is that four different roles touch the prompt, and the accountable party — the person whose on-call pager fires when the prompt regresses — is often not even one of them. The golden rule of RACI, that exactly one person is accountable, gets violated silently. Accountability diffuses into the seams between roles, which is where work goes to die.

Fixing this does not require a new tool. It requires a named owner per prompt. A CODEOWNERS-style file that maps every production prompt — not every file that contains a prompt, every prompt — to a specific person and a specific team. That person reviews every change. That person is paged when a regression lands. That person has the authority to block a change, including one that originates in a different team. The moment you add this file, a handful of prompts will turn out to have no owner, and you will learn more about your org in that exercise than in the previous six months of retros.

The shared prompt library failure mode

Once an organization has more than three or four products using an LLM, someone — usually a platform team — suggests a shared prompt library. It sounds correct. Prompts are expensive to write, expensive to eval, and have general-purpose utility. Sharing them should reduce duplication, improve consistency, and let a small team of prompt experts raise the quality floor for everyone.

What actually happens is that the shared library becomes a read-only artifact from the consumer's perspective. Every team forks it. The fork lands in the team's own repo. The fork grows team-specific modifications. The original library continues to evolve in its own repo, with its own maintainers, disconnected from the forks. When the upstream library ships a quality improvement, none of the forks pick it up, because no forked team has the budget to re-reconcile and re-eval. When the upstream library ships a regression, none of the forks catch it, because the forks have diverged. The shared library has produced two kinds of debt simultaneously: divergence debt and maintenance debt. Both are invisible until the day they are not.

The alternative pattern that actually survives contact with real orgs is thin shared primitives and thick team-owned compositions. The shared library owns small, testable building blocks — a safety preamble, a structured output contract, a multi-turn pattern for clarification. Teams compose these into product-specific prompts. The library's contract with consumers is narrow and versioned. Consumers upgrade on their own schedule. Regressions have a narrow blast radius because each primitive has its own eval suite. This does not eliminate coordination, but it replaces the unbounded coordination of "whenever the shared library changes, everything might break" with the bounded coordination of a versioned dependency.

The deeper lesson is that a shared prompt library is a product, not a utility. It needs its own PM, its own deprecation policy, its own support rotation, and its own SLA for consumer migration. Teams that launch a shared library without these things are launching a free service that nobody owns, and free services that nobody owns do not get maintained.

The eval ownership asymmetry

The second ownership problem sits one layer lower. Even if every prompt has a named owner, the eval suites that gate prompt changes usually do not. Eval sets in most organizations are authored by whichever engineer happened to care about a particular failure mode, stored in whichever directory they were first written, and maintained on an ad-hoc basis by whichever team currently feels pain.

This creates a sharp asymmetry. The person who changes a prompt is accountable for it. The person whose eval fires on that prompt change is, at best, consulted. The eval authors are usually not even informed that their eval ran. So when an eval regresses, the prompt owner has two unappealing options: fix the prompt, or debug somebody else's eval. Debugging somebody else's eval is slow, low-status work with no clear success criterion, because the eval itself may be wrong. Many prompt changes quietly ignore eval regressions by dismissing them as "flaky" or "known failures," because doing anything else is punitively expensive.

The way out of this is to make eval sets first-class owned artifacts, just like prompts. Every eval set has a maintainer. Every failure mode captured by an eval has a link to the incident or ticket that motivated it. When an eval flakes, the maintainer fixes it on a timeline; when it regresses, the prompt owner calls the maintainer. This is the organizational analog of test ownership in regular software engineering, except that LLM evals are noisier and more subjective, which raises the coordination stakes.

Orgs that treat evals as a shared-infrastructure chore inevitably end up with an eval suite that nobody trusts, which is functionally equivalent to no eval suite at all. Trust is the output of visible maintenance over time. Without named maintainers, there is no maintenance, and trust does not form.

Rewiring the org so the loop closes

If all of this sounds like an organizational problem rather than an AI problem, that is exactly the point. Conway's law says your system architecture will reflect your communication structure. The practical implication for prompt engineering is that you should design the communication structure first and let the prompt architecture follow.

A few concrete rewiring patterns that work in practice:

  • One accountable owner per production prompt, documented in a file checked into the repo. No ambiguity, no diffuse accountability.
  • Shared eval harness with team-contributed fixtures. Any stakeholder can add a regression case. Any prompt change runs the full harness before merge.
  • Prompt changes travel with their eval diffs. A PR that modifies a prompt shows which eval cases moved and by how much, in the PR description, as a gate on the merge.
  • Incidents generate evals, not just fixes. Every prompt-related incident must land an eval case that would have caught it. This converts tribal knowledge into institutional memory.
  • Named stewards for shared primitives. The shared prompt library and its components have on-call owners with deprecation authority.
  • A single cross-functional role — call it the prompt steward, the behavior lead, the model-product engineer — that holds the whole loop. One person who can write a prompt, read an eval, talk to a model researcher, and reply to a support ticket. This role is the organizational answer to "the prompt works for my team" because there is now somebody whose job is to notice when it does not work for everyone else.

The organizations that are shipping AI products without chronic quality leaks are the ones that have already absorbed this lesson. Their org charts have a role whose title contains the word "prompt" or "behavior" or "model" in a way that crosses functions. Their repositories have a PROMPT_OWNERS file. Their deploy pipelines have an eval gate. Their incident postmortems include a "what eval would have caught this" section. None of this is exotic. It is the same discipline that regular software engineering learned in the 2010s, applied to a new asset class.

The takeaway

Prompts are not strings. They are production surfaces, modeling artifacts, and regression risks, all at once. When the authorship, evaluation, deployment, and support of a prompt live in four different teams with four different metrics, Conway's law guarantees that the prompt will mirror the coordination tax of those teams — and your users, who experience the assistant as a single voice, will get the inconsistency. The fix is not a better tool. The fix is naming an owner, building a shared eval, and putting one person in a seat whose whole job is to hold the loop. Do that, and the prompt stops being an organizational liability. Skip it, and every model upgrade, every feature launch, and every team reorg becomes an opportunity for a silent regression to land on users who have no idea why the assistant suddenly got worse.

References:Let's stay in touch and Follow me for more thoughts and updates