Skip to main content

Conflicting Instructions in System Prompts: The Silent Failure Mode No One Owns

· 10 min read
Tian Pan
Software Engineer

Your AI feature worked great at launch. Six months later it sometimes gives terse one-liners, sometimes writes five-paragraph essays, and occasionally refuses to answer questions it handled without complaint last quarter. Nothing in the codebase changed — or so you think. The system prompt changed, incrementally, through eleven pull requests authored by four engineers across two teams. Each change was individually sensible. Collectively, they turned your prompt into a contradiction machine.

This is the instruction contradiction problem. It does not throw an exception. It does not appear in error logs. It manifests as behavioral drift — the model doing subtly different things in subtly different situations in ways that are hard to reproduce and harder to attribute. By the time a user files a bug, the prompt has already been patched twice more.

How System Prompts Accumulate Contradictions

Think about how prompts actually change in a production codebase. A product manager asks for more detailed explanations — a developer appends "always explain your reasoning step by step." Two weeks later, a support escalation about verbose responses prompts another developer to prepend "be concise and direct." Nobody removes the old instruction; they are in different sections of a 600-token prompt and the review has already merged.

This is not negligence — it's the natural consequence of treating system prompts like append-only config files. Each contributor sees their own change in isolation. The prompt has no enforced structure that would make the conflict obvious in a diff. The contradiction only surfaces at inference time, when the model has to reconcile two directives it was never trained to prioritize.

The same accumulation pattern repeats for every prompt concern:

  • Tone: "be friendly and casual" added in January, "maintain a professional tone at all times" added in March
  • Format: "use bullet points for lists" in one section, "prefer flowing prose over fragmented lists" in another
  • Scope: "answer any user question helpfully" at the top, "only answer questions about our product" buried at line 40
  • Safety and features: "never refuse a user request" from one team, "refuse requests that could be misused" from another

Research on prompt debt in open-source LLM projects has found that instruction-block prompts are among the most failure-prone categories precisely because they accumulate these conflicts without any structural mechanism to surface them. The longer the prompt, the higher the density of latent contradictions.

How the Model Resolves What You Did Not

When a model encounters conflicting instructions, it does not raise an error. It makes a choice — and that choice is not always the one you intended.

Several well-documented biases govern how models handle conflicts:

Recency bias. Instructions appearing later in the prompt receive more weight, especially in long contexts. This means the "winner" of a contradiction is often whichever engineer appended their instruction most recently — an implicit priority system you never designed.

Position effects. Models attend strongly to the beginning and end of context and weakly to the middle. Instructions buried in the middle of a 1000-token prompt may be functionally ignored when they compete with instructions at the edges. Studies of multi-instruction following have found accuracy drops exceeding 30% for instructions in the middle of context windows.

Instruction distraction. When user input resembles an instruction itself — a question phrased as a directive, for example — it can override system-level instructions regardless of where they appear. This means your system prompt's authority is not as absolute as the system/user prompt separation implies.

Unpredictable arbitration. When none of the above biases produce a clear winner, the model falls back on patterns from pretraining. The resulting behavior may look reasonable in isolation but diverges from what any individual instruction specifies. There is no "tie-breaking rule" you can rely on; the behavior is effectively emergent.

The consequence is that contradictions in a system prompt create a nondeterministic behavioral envelope. Different inputs, temperature settings, and even model updates shift which instruction wins. A production model that appeared to handle the conflict gracefully may start favoring a different instruction after a silent model version bump.

What Production Failure Looks Like

Behavioral drift from instruction contradictions rarely appears as a clean bug. It looks like:

  • A customer support bot that is sometimes brief and sometimes verbose, with no obvious pattern, causing inconsistent CSAT scores
  • A code assistant that writes comments when asked to explain and skips them when asked to generate, because two instructions disagree about documentation style
  • A content moderation feature that flags certain categories inconsistently, because a safety instruction and a helpfulness instruction conflict at the edge cases
  • An agent that behaves correctly in single-turn evals but degrades over multi-turn conversations, because state accumulation shifts which instruction the model treats as authoritative
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates