Prompt Position Is Policy: The Silent Merge Conflict When Three Teams Co-Own a System Prompt
The diff in your prompt repo says three lines changed. The behavioral diff in production says everything changed. The safety team moved a refusal rule from line 14 to line 87 to "group it with related guardrails," the product team didn't notice because the wording was identical, and a week later the eval suite is showing a 9-point drop on adversarial inputs. Nobody edited the rule. Somebody moved it. In a 2,400-token system prompt with primacy bias on guardrails and recency bias on instruction-following, moving a rule is a behavioral change as load-bearing as rewriting it — and your tooling surfaces neither.
This is the merge-conflict pattern that AI teams discover at the end of a regression review, not the beginning of one. The system prompt grew past 2K tokens sometime in late 2025. The safety team owns the top, the product team owns the middle, the agent team owns the bottom, and three months of "small edits" have silently rearranged everyone else's intent because the line-based diff tool that worked fine for code can't tell you that an instruction crossed a section boundary. The bug isn't in any single edit. The bug is that position is now policy, and you have no policy on position.
Why position became load-bearing in 2025
Through 2023 and most of 2024, the typical production system prompt was 200 to 800 tokens. At that size, position effects are real but small enough that the model's instruction-following swamps them. You write Refuse requests that... near the top, Use a friendly but concise tone in the middle, When calling tools, prefer X over Y at the bottom, and the model basically does what you said. Position arguments felt like prompt-engineering folklore.
Two things changed. First, system prompts crossed 2K tokens for any team running an agent with non-trivial tool catalogs, RAG context shaping, output-format contracts, and multi-locale persona rules. Second, the long-context literature caught up to what practitioners had been working around: the Lost in the Middle paper demonstrated a U-shaped performance curve where information at the extreme poles of the context is recalled and acted upon at much higher rates than information in the middle, even for explicitly long-context models. Follow-up work in 2025 — the Serial Position Effects of Large Language Models paper is the cleanest reference — formalized this for instruction-following specifically. Mechanistic work from MIT in 2025 traced it to a structural property of causal-masked attention: tokens at the front accumulate attention weight from every subsequent token, while a token at position 500 in a 2,000-token prompt is only attended to by tokens 501 through 2,000, and degrades into the middle of the U.
The practical consequence is that an instruction's position in a 2K+ token system prompt is comparable in influence to its content. A guardrail moved from token 100 to token 1,200 doesn't have the same enforcement strength even when the wording is identical. A tool-use rule moved from the bottom of the prompt to the middle becomes more advisory and less imperative, because recency bias was doing real work in the bottom slot.
For teams whose system prompts have crept toward 4K or 8K tokens — common for agent products with substantial tool documentation or detailed output schemas — the U gets steeper and the middle gets quieter. The 2026 mechanistic paper on transformer position bias argues the U-shape is partly a geometric property of the decoder rather than a training artifact, which means scaling the model up doesn't make this go away.
The three-team failure pattern
The shape of the failure looks like this. A safety team owns refusals, content policy, and PII rules — they want their stuff at the top because primacy bias is the friend of guardrails. A product team owns persona, tone, brand voice, and example outputs — they want their stuff somewhere visible because tone is what the user perceives. An agent team owns tool-use rules, output-format contracts, and structured-output schemas — they want their stuff at the bottom because recency bias is the friend of instruction-following on the immediate next action.
In the 200–800 token regime, all three teams could put their stuff anywhere and the model would mostly behave. In the 2K+ token regime, the three preferences become structural requirements. Guardrails must be at the top. Tool rules must be at the bottom. Persona is the only thing that can absorb the middle without hurting much. Once the prompt crosses that threshold, the implicit ordering becomes load-bearing — but nobody wrote the ordering down.
So the safety team adds a new refusal rule. They put it at line 14 because that's where their other rules live. The product team is editing the persona section and notices the file is getting long, so they consolidate by moving the new refusal "with the other safety items" — which they put in a section starting at line 80 because that grouping made narrative sense to them. The agent team adds a tool rule at the bottom. The system prompt got longer, the refusal moved from primacy slot to middle slot, and three months of evals later somebody notices the model is more easily talked into edge-case violations than it used to be.
Nobody changed any words. The semantics of the prompt, as the model experiences it, changed substantially. The line-based diff in the PR review showed a clean refactor.
What a position-aware diff would look like
A position-aware diff treats a system prompt as a sequence of sections with positions, not a sequence of lines with content. The minimum useful version answers four questions about every PR:
Did any instruction cross a section boundary? A guardrail moving from the "safety" region to the "tools" region is a major change even if the wording is identical. The diff should flag this with the same prominence a code review would flag a function moving between modules.
Did the relative position of any owned section shift? If the safety section starts at token 0–400 in main and at token 400–800 in the PR, that is the kind of change that gets a 9-point eval drop on adversarial inputs. The diff should show position-as-percentage-of-total and call out shifts greater than some threshold (10% of total length is a reasonable starting point).
Did the total prompt length cross a position-bias threshold? Crossing the 2K-token line, or the 4K-token line, or the model's stated context-saturation point, is a change in how much position bias is in play. A diff that says "prompt grew from 1.8K to 2.3K tokens" should be flagged because it crossed a regime change, not just because it grew.
Are guardrails still in primacy and tool rules still in recency? A simple invariant check: enumerate the rules tagged "guardrail" and "tool" and verify their position percentiles are in the expected zones. This catches the case where well-meaning consolidation shoved a guardrail into the middle.
The first version of this tool doesn't need any ML. A YAML manifest declaring section ranges and ownership, a parser that splits the prompt by section markers (XML tags or comment delimiters), and a CI check that diffs section positions and flags movements is a weekend project. The hard part is getting the three teams to agree on the manifest.
Sectioned system prompts as a refactor target
Most production system prompts in 2026 are still written as a single flat string. The refactor that has to land is sectioning the prompt into named, owned regions with explicit ordering. The structure that works for most teams looks like:
A guardrails section at the top, owned by safety, containing refusals, content policy, PII handling, and jurisdiction-specific rules. This region is invariant: nobody else can edit it, and the safety team commits to keeping it appendable rather than reorderable.
A persona and context section in the upper-middle, owned by product, containing brand voice, tone calibration, audience assumptions, and high-level capability framing. This region absorbs the middle of the U-shape because tone violations are recoverable in a way guardrail violations are not.
A task and output section in the lower-middle, owned jointly by product and agent, containing the user-facing task framing and the output contract. This region is where most edits actually happen and where the most coordination is required.
A tool-use and immediate-action section at the bottom, owned by agent, containing tool selection rules, structured-output schemas, and the format the model should emit on its very next turn. This is the recency slot and it has to stay imperative.
Each section starts and ends with explicit markers — XML tags, comment delimiters, or whatever your eval harness can parse. Each section has an owner: annotation in a manifest file. PRs that touch a section require review from that section's owner. PRs that move a rule across a section boundary require review from both owners and a flag in the PR description naming the position change as load-bearing.
This is more discipline than most teams have today. It's also less discipline than the average code repo has for module ownership, where CODEOWNERS files have been load-bearing for a decade.
What to put in your eval suite
Sectioned prompts and position-aware diffs are necessary but not sufficient. The eval suite has to verify the position invariants are doing the work you think they are. Three additions to the existing eval set:
A position-permutation eval. For each of your top failure-mode categories — guardrail violations, off-tone outputs, tool misuse — run the suite once with the rule in its canonical position and once with the rule moved to the middle of the prompt. The delta tells you how much your behavior depends on position vs. content. If the delta is small, you can tolerate looser ordering discipline. If it's large, your CI has to be strict about it. Most teams are surprised by how large the delta is for guardrails specifically.
A length-threshold eval. Run the suite at the current prompt length, at 1.5x, and at 2x — synthesize the extra content with neutral filler that sits in the middle. This tells you how robust the prompt is to growth, which is a thing that will happen whether you plan for it or not. A prompt that passes at 2K but fails at 4K is a prompt that will silently regress over the next two quarters as the three teams keep adding their stuff.
A cross-section eval. Specifically test the interactions between sections — a refusal rule plus a persona rule plus a tool rule that all touch the same edge case. The failure mode you're hunting is "the persona section softened the refusal because the model is doing recency-weighted averaging across instructions." This won't show up in section-level evals.
The architectural realization
The system prompt is shared infrastructure. Three or more teams write to it. The teams have legitimate, structural preferences about position that conflict with each other. The model treats position as roughly equal in importance to content for prompts past a certain length. And the tooling — diff tools, review processes, version-control systems — surfaces line-level changes and is blind to position-level changes.
This is the same problem that monorepos solved for code with CODEOWNERS files, that microservices solved for APIs with schema registries, and that Kubernetes solved for config with admission controllers. Prompt repos haven't gone through that maturation yet because they're new and because most prompts were short enough that the problem didn't bite. In 2026 they are no longer short enough. The merge conflicts are real, the tooling doesn't see them, and the eval suite catches the regression weeks after the change that caused it.
The team that ships a position-aware diff tool, a sectioned-prompt manifest, and a position-permutation eval as a paved-road internal tool will spend a quarter on it and save a year of mystery regressions. The team that doesn't will keep paying the tax — one nine-point eval drop at a time, attributed to a model upgrade or a config drift or anything except the line-based diff that hid the actual change.
Position is policy. Write the policy down.
- https://arxiv.org/abs/2307.03172
- https://aclanthology.org/2025.findings-acl.52.pdf
- https://arxiv.org/html/2406.15981v1
- https://atlassc.net/2026/03/30/the-architecture-of-prompt-sequencing
- https://arxiv.org/html/2507.13949
- https://www.lakera.ai/blog/prompt-engineering-guide
- https://launchdarkly.com/blog/prompt-versioning-and-management/
- https://www.braintrust.dev/articles/best-prompt-versioning-tools-2025
- https://www.getmaxim.ai/articles/prompt-versioning-best-practices-for-ai-engineering-teams/
- https://visualstudiomagazine.com/articles/2023/06/27/complex-prompting.aspx
