The Policy File: Why Your Refusal Rules Don't Belong in Your System Prompt
A safety reviewer at a fintech startup pushed a four-line addition to the system prompt last quarter. The change: a refusal rule preventing the assistant from giving specific tax advice for a jurisdiction the company didn't have a license to operate in. Reasonable, narrow, audit-clean. The rule landed on Tuesday. By Friday the eval suite was showing a 7-point drop on a customer-onboarding flow that had nothing to do with tax — the model had started hedging on every question that mentioned a country, including "what currency does this account hold." The product team backed out the change. The safety team re-shipped it the following week with slightly different wording. Three weeks later, the same regression appeared in a different shape, and the next safety edit broke a different unrelated flow.
The bug here isn't the wording. The bug is that the refusal rule is in the wrong place. It's wedged inside a 2,400-token artifact that also contains the assistant's conversational voice, its formatting contract, its task instructions, and a half-dozen other policy clauses — and every edit to any of those concerns is a behavioral edit to all of them, because the model can't tell which sentence is policy and which is style. Production system prompts grow into a tangled monolith because three orthogonal concerns are pretending to be one. The teams who haven't factored them out are paying the integration tax on every edit.
Three concerns, three reviewers, three cadences
Open the system prompt of any sufficiently mature LLM product and you'll find three kinds of text wedged together with no separator more meaningful than a blank line.
The first kind is conversational instruction — the assistant's voice, persona, default tone, how to ask clarifying questions, how to handle ambiguity. ("You are a helpful assistant for an e-commerce store. Be warm but concise. When the user is upset, lead with empathy.") This is a product-writer concern. The reviewer is the head of product or a designer. It changes when the brand voice changes, which is rare.
The second kind is output formatting — the structural contract on the response. Markdown vs plain text, code in fenced blocks, bullet rules, never include emoji unless asked. ("Respond in markdown. Use H3 headings for product categories. Quote prices with the local currency symbol.") This is an engineering concern. The reviewer is whoever owns the renderer downstream of the model. It changes when the renderer changes, which is also rare, but for completely different reasons.
The third kind is policy — the refusal rules, jurisdictional constraints, competitor-mention bans, escalation triggers, anything the model must or must not do regardless of how the user phrased the request. ("Never recommend a competitor product by name. Refuse requests for medical or legal advice. If the user mentions self-harm, escalate to the crisis-resources flow.") This is a compliance and trust-and-safety concern. The reviewer is legal, compliance, or T&S. It changes whenever the regulatory or risk landscape shifts, which is often — and sometimes urgently, in response to a single bad incident.
Three concerns, three different reviewers, three completely different reasons to change. And in most production systems, all three live in the same string, get diffed by the same tool, get reviewed by the same person, and ship on the same release cadence. The team learns this the first time a one-line policy edit causes a behavioral regression on a feature the policy edit had nothing to do with — and then learns it again, and then learns it again, until somebody finally proposes the factoring.
Why the monolith always degrades
The monolith doesn't fail because any individual edit is wrong. It fails because the edits interact, and the interactions are invisible at edit time.
A long system prompt has position bias — instructions near the top and bottom carry more weight than instructions in the middle. Refusal rules placed early read as identity ("you are the kind of assistant that doesn't do X"); placed late they read as the most recent instruction the model saw. Same words, different behavior. So when the safety team adds a fifth refusal clause and groups it with the existing four for tidiness, they shift the relative position of every other clause — and behavior moves on tasks nobody touched.
Worse, refusal language is generative. Telling the model "do not give tax advice for jurisdictions outside the US" doesn't just block tax advice. It changes the conditional distribution over every response that mentions a non-US jurisdiction, because the model is now primed for caution in that semantic neighborhood. The hedge spreads. The model starts adding disclaimers to currency conversions, to time-zone questions, to country-of-origin lookups — anything that brushes against the activated concept.
This is why teams who measure see the same pattern: a clean, narrow, well-intentioned policy edit produces a diffuse, hard-to-attribute regression across unrelated evals. The fix isn't better wording. The fix is that the policy shouldn't have been concatenated into the prompt in the first place. It should have been a separate artifact, with its own evaluator, its own taxonomy, and its own compile-time injection point.
What factoring looks like
The factoring has three pieces, one per concern, and each piece becomes a different kind of object in your codebase.
Conversational instruction stays as prose. It's the only piece that genuinely needs to be free-text, because voice is hard to schematize. It lives in a versioned file owned by product, reviewed by product, and changed on the rare occasions product cares to change it. Engineering touches it only to update placeholders.
- https://medium.com/@2nick2patel2/llm-feature-flags-in-backends-policy-driven-prompts-and-safe-rollouts-9b8361ca4479
- https://arxiv.org/html/2512.04408v1
- https://arxiv.org/html/2509.23994v1
- https://github.com/NVIDIA-NeMo/Guardrails
- https://medium.com/@aatakansalar/writing-guardrails-as-yaml-enforcing-them-with-opa-e17c76e05a4a
- https://www.lakera.ai/blog/prompt-engineering-guide
- https://medium.com/@deepakreddy1635/from-code-review-to-prompt-review-how-engineering-teams-should-review-prompts-in-2025-1fcf7b35aa7a
- https://aclanthology.org/2023.emnlp-demo.40.pdf
- https://cdn.openai.com/rule-based-rewards-for-language-model-safety.pdf
- https://www.datadoghq.com/blog/llm-guardrails-best-practices/
- https://arxiv.org/pdf/2502.12197
- https://www.spheron.network/blog/nemo-guardrails-production-deployment-llm-gpu-cloud/
