The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product
You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.
This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.
Understanding why this happens — and what to do about it — is one of the more consequential engineering problems for teams building on top of frontier models.
Why Safety Training Suppresses Specific Capabilities
Post-training alignment via RLHF and its variants (DPO, Constitutional AI) improves models along axes that are broadly desirable: reduced toxicity, better refusal rates on harmful content, more calibrated uncertainty. But these changes aren't free. They occur by adjusting model weights in ways that shift behavior across all inputs, not just harmful ones.
The mechanics matter here. Safety behaviors aren't cleanly isolated — the gradient updates that reduce harmful output also suppress certain token distributions that happen to correlate with risky behavior. Confident assertions use similar linguistic patterns to misinformation. Domain jargon overlaps with in-group expert knowledge that can be misused. Precise instruction-following at the expense of caveats looks like manipulation when applied in bad-faith contexts. So the training pushes against these patterns globally.
Research has formally described this as the "alignment tax": the alignment tax rate scales with the squared projection of the safety training direction onto the capability subspace. In plain terms: the more safety training interferes with general capability space, the more you lose in production tasks. Newer papers show this is particularly pronounced in reasoning models, where safety alignment demonstrably degrades step-by-step reasoning quality.
The key insight practitioners often miss: the capability isn't destroyed, it's suppressed. The model still has access to those behaviors — the weights haven't been zeroed out. They've been down-weighted in a way that makes them harder to trigger. This is why elicitation works.
What Actually Regresses and Why It's Hard to Notice
Not all capability regressions are equal. The ones that hurt production systems most fall into a few patterns:
Instruction compliance drift. The model follows the spirit of your instructions rather than their letter. You ask for a JSON object with exactly these five fields, and you get six fields because the model decided one more would be helpful. This kind of drift is subtle — outputs look right but fail downstream parsing.
Assertiveness collapse. Newer aligned models add hedges and qualifiers that older versions skipped. Where GPT-3.5 would write "the answer is X," a newer safety-tuned model writes "it's worth noting that X is generally considered the answer, though this may vary." This is fine for general-purpose chatbots but breaks applications that need authoritative tone — customer-facing communications, domain expert simulations, structured decision tools.
Style and format suppression. If your application depended on a specific output format (particular markdown structure, code style, length constraints), a new model may override this with what its training deemed "better" formatting. The model has absorbed enough RLHF signal about what "good output" looks like that your explicit instructions compete with internalized preferences.
Domain vocabulary normalization. Models fine-tuned on broad safety corpora often regularize domain-specific language toward generic alternatives. Medical, legal, financial, and technical domains all have precise vocabulary where word choice matters. Aligned models may systematically substitute precise terms with accessible paraphrases.
The reason these regressions are hard to notice: standard benchmark scores don't capture them. MMLU measures breadth of knowledge. HumanEval measures code correctness. Neither captures whether your specific format is preserved or whether your tone requirements are met. You need your own evaluation suite — and most teams don't build one until after the first painful regression.
Detecting Regressions Before You Ship
The fundamental discipline here is treating model upgrades like software dependency upgrades: you don't ship without running your test suite first. The challenge is building that test suite.
Construct a golden dataset. Take 50–200 representative inputs from your production traffic, ideally across the distribution of use cases your product handles. Include edge cases that have burned you before. For each input, create a reference output or a set of criteria that a good output must satisfy. This dataset becomes your regression benchmark.
- https://arxiv.org/html/2603.00047
- https://arxiv.org/abs/2503.00555
- https://arxiv.org/html/2512.11391
- https://www.assemblyai.com/blog/how-rlhf-preference-model-tuning-works-and-how-things-may-go-wrong
- https://www.interconnects.ai/p/undoing-rlhf
- https://arxiv.org/html/2602.10144
- https://metr.github.io/autonomy-evals-guide/elicitation-gap/
- https://www.iaps.ai/research/evaluation-awareness-why-frontier-ai-models-are-getting-harder-to-test
- https://arxiv.org/html/2509.13196v1
- https://www.promptingguide.ai/techniques/fewshot
