Skip to main content

The Prompt Author Identity Problem: Three Roles Editing the Same File

· 13 min read
Tian Pan
Software Engineer

Pull up the git blame on any year-old production system prompt and you will find something the engineering team is not ready to admit: the file has three authors, none of whom share a definition of what a "change" is. The engineer who refactored the instruction blocks last month logged the commit as "no functional change, just reordering for clarity." The product manager who reads the file once a quarter would describe the same diff as "you rewrote the voice — customers will notice." The ML engineer running the regression suite would call it "you broke few-shot example three, and the eval has been red ever since."

All three are right. The prompt is simultaneously code, spec, and hyperparameter, and every team that ships an AI feature long enough discovers that the file's commit history is a slow-motion three-way authorship dispute that CODEOWNERS does not capture and the diff viewer does not surface.

The standard reaction is to add more reviewers. That does not work, because the reviewers do not disagree on whether the change is good — they disagree on what the change is. Until the org separates the three authorship roles structurally, "two approvals required" just means whichever two people happen to be available end up rubber-stamping a change none of them fully understand.

Three Mental Models, One Text File

Engineering treats the prompt as code. The instinct is to apply the same hygiene every other text artifact in the repo gets: lint it, version it, write tests against it, refactor when the structure rots, deduplicate when two sections drift. An engineer reading a long system prompt sees redundant phrasing the way they would see redundant code — something to tighten. They will happily collapse three nearly identical instructions into one, because that is what good engineers do.

Product treats the prompt as a spec. To product, the prompt is the closest thing the codebase has to a written description of what the feature does. The phrasing matters because phrasing is the contract: "be concise" and "be brief" are not the same instruction, and a customer reading the feature description in marketing copy versus the feature behavior in production should not get whiplash. Tone is not decoration — it is the feature surface. When product reads "no functional change, just reordering for clarity," product hears "I rewrote the spec without consulting me."

ML engineering treats the prompt as a hyperparameter. The phrasing is a knob, the few-shot examples are training data, and the only thing that matters is what the eval suite reports. An ML engineer will happily delete a sentence that reads beautifully if the eval says it is not pulling weight, and will defend a sentence that reads terribly if the eval says it is. To ML, "broke the eval" is the only sentence in the disagreement that has empirical content.

None of these models are wrong. They are addressing different layers of the same artifact, and the file format does not distinguish between them.

What a "Change" Means to Each Role

The three roles disagree on the definition of "change" because they each see a different surface of the prompt as load-bearing:

  • For engineering, load-bearing means the instruction is non-redundant and the structure is parseable. A change is a change when behavior is altered; rephrasing for clarity is invisible.
  • For product, load-bearing means the words a model emits map to the words a customer would expect to read. A change is a change when a customer-detectable trait of the output shifts — formality, length, register, ordering.
  • For ML, load-bearing means the eval suite scores the prompt above its current baseline. A change is a change when a metric moves.

These definitions overlap but do not coincide. Engineering can ship a refactor that product calls a tone regression and ML calls a 1.8-point eval drop. Product can ship a wording tweak that engineering files under "cosmetic" and ML notices because the eval picked up a 0.4-point lift the team had been chasing for two sprints. ML can swap a few-shot example because the eval improved, and engineering will find out two months later when an unrelated change accidentally rewords the example back to its previous form and the eval mysteriously regresses.

The result is that "did the prompt change?" is an unanswerable question without specifying for whom. The PR template that asks for a description of the change is collecting an answer that is true for one role and incomplete for the other two.

The Tragedy of the Commons in the Prompt Repo

When write access to the prompt is given to everyone who needs it, and no role's veto power is encoded anywhere, the file decays in a specific pattern.

First, each role does unilateral edits within its own mental model. Engineering refactors sections without product review because "it is just structure." Product tweaks wording without ML review because "it is just tone." ML adds and removes few-shot examples without engineering review because "it is just eval-tuning." Each edit is locally justifiable.

Second, the edits compound. The refactor moves a sentence that product wrote in a specific position because it anchored a behavioral trait. The wording tweak slightly reorders a clause that ML had calibrated against a specific failure mode. The eval-tuning edit deletes a sentence whose only purpose was, unbeknownst to ML, to keep a particular customer-facing behavior consistent.

Third, when something breaks in production six months later, none of the three roles can fully explain the prompt anymore. Engineering can describe the structure but not the eval rationale. Product can describe the voice but not the few-shot calibration. ML can describe the eval-tuning history but not why certain wording exists. The artifact has become a tragedy of the commons where the team's collective intelligence about it is lower than the sum of the parts.

This is the failure mode where teams declare a "prompt freeze" and try to rewrite the file from scratch — and discover, to their horror, that the rewrite scores worse on the eval and reads worse to product simultaneously, because the existing prompt had accumulated more constraints than anyone realized.

Structural Separation Is the Fix

The path out is to stop treating the prompt as a monolithic file and start treating it as a structured artifact with role-owned sections. The format does not need to be exotic — it can be as simple as named sections with explicit ownership comments, or as elaborate as a templated prompt where each section is sourced from a separate file with its own CODEOWNERS rule.

A workable structure has three top-level sections, each owned by exactly one role:

The system instructions section is owned by engineering. It defines the model's operating context: what tools are available, what the output format must look like, what safety rails apply, what failure modes are explicitly handled. This is the part of the prompt that has the structure of code. Engineering can refactor this section without consulting the other roles, as long as the contract it exposes to the other sections is preserved.

The behavioral spec section is owned by product. It defines the voice, tone, persona, register, brand alignment, and behavioral promises the product makes to the customer. Product can rewrite this section without engineering signoff, because changing the spec is product's job. Engineering and ML can advise, but they cannot unilaterally edit it any more than they would unilaterally edit the marketing copy.

The few-shot examples section is owned by ML. It contains the calibrated input-output pairs the eval suite depends on, with metadata about which failure modes each example guards against. ML can add, remove, and reorder examples without consulting the other roles, as long as the eval suite stays green. Engineering and product can flag examples that conflict with the system instructions or the behavioral spec, but they cannot rewrite them.

Modern prompt management platforms encode some of this separation natively — production aliases held by engineering managers, draft versions open to product, eval-gated promotion controlled by ML. The discipline transfers to a plain repo with a few conventions and a CODEOWNERS file that names a different team for each section.

The PR Template Should Force a Change-Type Declaration

Even with structural separation, edits cross section boundaries. Product wants to add a new behavioral promise that requires a new few-shot example to anchor it. ML wants to tune the eval against a new failure mode that requires a tweak to the system instructions. Engineering wants to refactor in a way that touches all three.

For these cross-cutting changes, the PR template needs a change-type taxonomy:

  • Behavior change — alters what the model does in a customer-detectable way. Requires product signoff. Triggers a full eval run plus an A/B harness if the surface is high-traffic.
  • Tone change — alters how the model phrases its output without changing what it does. Requires product signoff. Triggers a stylometric eval if one exists.
  • Structural refactor — reorganizes the prompt without intent to change behavior. Requires engineering signoff. Triggers a full eval to verify the refactor was actually behavior-preserving.
  • Eval-tuning edit — adjusts the prompt to fix a specific eval regression. Requires ML signoff. Triggers a check that the edit does not move the customer-detectable behaviors the behavioral spec governs.

Labeling each PR with one of these tags forces the author to declare their model of the change before review. A reviewer who sees a "structural refactor" tag knows to verify the eval did not move. A reviewer who sees a "tone change" tag knows to scan the diff for behavioral surface changes, not for clarity wins. The taxonomy turns "did this change behavior?" from an argument into a verifiable claim.

The taxonomy also captures a category of changes that quietly destroy prompt repos: the unintentional cross-cutting edit. An engineer who labels a PR "structural refactor" but accidentally changes the wording of a behavioral promise is now in a category mismatch the reviewer can catch. Without the taxonomy, the diff just looks like a refactor and ships.

Shared Vocabulary, Not Shared Veto

A common failure mode in fixing this problem is over-correction: requiring every role to sign off on every change. That trades one tragedy for another — the prompt becomes uneditable because three calendars never align, and the team's iteration velocity collapses below the rate at which the underlying model is changing.

The goal is shared vocabulary, not shared veto. The three roles need to agree on:

  • What sections exist and who owns each.
  • What change types exist and which require which signoffs.
  • What "the prompt does" means at the level the customer sees, the level the eval measures, and the level the model parses.
  • What disagreements are about implementation (resolvable by the owning role) versus direction (resolvable only by escalation).

When the vocabulary is shared, the disagreements happen at design time. Product proposes a new behavioral promise and engineering points out that the system instructions cannot enforce it without a new tool, before either of them has wasted a sprint. ML proposes deleting a few-shot example whose eval contribution has decayed, and product flags that the example is the only thing preserving a customer-facing trait the spec depends on, before the example is gone and the trait quietly disappears.

When the vocabulary is not shared, the disagreements happen at merge time, in a comment thread on a PR, and the person with the strongest opinion on that day wins.

The Architectural Gap the Tooling Has Not Closed

The diff viewer in the engineering team's PR tool was designed for code, where the question "what changed?" has a single answer. The prompt repo needs a diff viewer that answers three questions simultaneously: did the structure change, did the behavioral surface change, did the eval-measured behavior change. Some prompt management platforms are starting to ship versions of this — eval deltas alongside text diffs, semantic comparison across versions, change-type tagging — but the integration with the rest of the org's review surface is patchy.

Until that tooling matures, teams have to build the discipline manually. Section-owned files in a repo with role-named CODEOWNERS entries. A PR template with a change-type field that is a required dropdown, not an optional comment. An eval gate that runs on every PR and surfaces the delta in the review thread, so the eval-measured definition of "change" is visible regardless of which role authored the PR. A glossary in the repo that defines the shared vocabulary, so a new hire on any of the three teams reads the same definition of "behavior change" their counterparts read.

The cost of building this manually is real. The cost of not building it is higher and harder to see, because it shows up as a prompt nobody can fully explain, an eval regression nobody can fully attribute, and a feature surface that drifts from its spec at a rate the org cannot measure.

The Prompt Is a Contract, Not a Config

The reframe leadership has to make is that a system prompt in production is not a configuration file the AI team owns, the way the AI team owns a feature flag or a model temperature. It is a contract between three roles, each of whom has legitimate authorship over a different part of it. The org structures every other multi-stakeholder artifact this way — the marketing-engineering-legal review on a customer-facing landing page, the design-engineering-PM review on a major UX change, the legal-security-engineering review on a data-handling contract — and the prompt should not be the exception.

The teams that figure this out will ship faster, because the disagreements that used to derail a merge happen at design time and the reviews on individual PRs become narrower and faster. The teams that do not will keep treating each prompt edit as a unique event, will keep rediscovering the same three-way authorship dispute every quarter, and will keep ending up with prompts they cannot fully explain six months after the engineer, PM, and ML engineer who last touched them have all moved on.

The next person to git-blame the file should be able to read the section headers and know who to ask. That is the bar.

References:Let's stay in touch and Follow me for more thoughts and updates