Forced Conformance Bias: When the Model Rounds Your Intent to the Distribution Mode
A user asks for "a haiku about Postgres replication." The model returns a five-line poem about databases that mentions servers and synchronization, sounds confident, scans like English, and is not a haiku. A different user asks for "a regex that matches IPv6 addresses but explicitly rejects IPv4-mapped forms." The model returns a regex that matches IPv6 addresses, including the IPv4-mapped forms it was told to reject, and asserts in prose that the regex meets the spec. A third user asks for "an explanation of monads using only cooking metaphors, no mention of functions or types." The model gives a mostly-cooking explanation that uses the words "function" twice and "type" three times.
None of these is a refusal. None is an obvious hallucination. The model didn't say "I can't do that." It produced a confident, well-formed response that quietly relaxed the part of the request furthest from its training distribution mode, and the user has to be paying close attention to notice. The failure mode has a name worth using: forced conformance bias — the model rounds your intent toward the typical answer, the user reads the result as a faithful response, and the eval suite that should have caught it was itself drawn from typical phrasings.
This is not a model quality problem in the usual sense. The model is doing exactly what its training pushed it toward. It is a product reliability problem, and the team whose evals live at the mode of intent distribution is calibrating against the easy half of their actual workload.
The mechanism: typicality bias all the way down
The cause sits inside the post-training pipeline. Recent work on RLHF and preference data has shown that human annotators systematically prefer more typical responses, independent of task-specific correctness. Mere-exposure effects, processing fluency, and the cognitive ease of evaluating familiar text all push raters toward the answer that pattern-matches what a competent response usually looks like in that context. The bias is not small — empirical estimates put the typicality coefficient near 0.57, meaning a meaningful chunk of "preferred" in preference annotation is just "more typical."
When this signal is plugged into a KL-regularized RLHF objective, the math compounds the bias. The KL term anchors the policy to a non-uniform reference distribution, and the reward gradient sharpens that reference rather than broadening it. Output distributions collapse onto the modes of the underlying base model, and minority responses — which include most unusual user requests — get probability-mass-starved. Researchers call the extreme version "preference collapse." Practitioners encounter the milder, everyday version every time they ask for something even slightly off-distribution.
The visible symptoms are predictable once you know what to look for:
- Uncertainty suppression. Aligned LLMs exhibit substantially lower output uncertainty than professional human writers on the same creative-writing tasks, and they reach for the same well-worn lexical patterns regardless of prompt.
- Diversity loss. Sampled completions from RLHF'd models are markedly less diverse than samples from the underlying base model for the same prompt.
- Constraint relaxation. When a request packs in unusual constraint combinations, the model satisfies the constraints that align with frequent training examples and quietly drops or hedges the constraints that don't.
The user-visible effect of all three: an answer that looks correct, sounds confident, and is subtly wrong in the dimension that mattered.
Why standard evals don't catch it
The default eval pipeline is a closed loop with the failure baked in. Eval prompts are usually written by the same engineers and contractors who write the training data, are sourced from product analytics that surface typical user intents, and are graded on holistic "helpfulness" rubrics that reward fluent responses. Each step pushes the eval distribution toward the same mode the model is biased toward.
The result is a benchmark on which the model looks fine, leadership sees the green dashboard, and a non-trivial fraction of production traffic — the part that lives in the long tail of intent — gets silently approximated. The team's first signal that anything is wrong is usually a sharp-eyed customer complaint. By that point, the team has shipped against an eval that was selecting for the failure rather than catching it.
There are three ways this shows up in instruction-following benchmarks specifically. Models that score well on IFEval often degrade hard on multi-constraint extensions like RECAST and IFEval++ that pack four or more constraints into a prompt or generate "cousin prompts" with rephrasings, distractors, and constraint variations. Performance drops further still in multi-turn settings like DriftBench, where models have been measured losing nearly 40% of their constraint adherence as a conversation accumulates. Each of these benchmarks works by deliberately drifting away from the typical phrasing, and each surfaces a behavior gap that a typical-phrasing eval can't see.
If your eval suite hasn't been deliberately stressed against off-mode prompts, it doesn't matter how high the score is. You're measuring something else.
The discipline that lands: an off-mode eval slice
The fix is not to retire the regular eval. It's to add a slice that the regular eval is structurally incapable of replacing.
An off-mode eval slice is a deliberately constructed set of prompts at the tails of the intent distribution. Build it from three sources:
- Uncommon constraint combinations. Take constraints the model has clearly seen in isolation (write a poem, use only one syllable per word, avoid the letter "e", produce exactly seven lines) and stack them in combinations that almost certainly never appeared in training. Keep each constraint individually testable.
- Rare format requests. Ask for outputs in formats the model rarely emits — a Mermaid sequence diagram describing a domain the model usually answers in prose, a CSV with a specific delimiter and quote escape rule, a single sentence of fixed character length.
- Edge-of-distribution intents. Borrow shape from real users in the long tail of analytics — the queries that show up once a week and never return as a cluster. These are the intents the typical-eval pipeline has never seen and the model has never been graded on.
- https://arxiv.org/abs/2510.01171
- https://arxiv.org/abs/2405.16455
- https://arxiv.org/html/2602.16162v1
- https://arxiv.org/html/2505.19030
- https://arxiv.org/html/2512.14754v1
- https://arxiv.org/html/2604.17650
- https://news.mit.edu/2025/shortcoming-makes-llms-less-reliable-1126
