What Structured Outputs Actually Cost You: The JSON Mode Quality Tax
Most teams adopt structured outputs because they're tired of writing brittle regex to extract data from model responses. That's a reasonable motivation. What they don't anticipate is discovering months later, when they finally measure task accuracy, that their "reliability improvement" also degraded the quality of the underlying content by 10 to 15 percent on reasoning-heavy tasks. The syntactic problem was solved. A semantic one was introduced.
This post is about understanding that tradeoff precisely — what constrained decoding actually costs, when the tax is worth paying, and how to build the evals that tell you whether it's hurting your system before you ship.
How Constrained Decoding Works
The mechanism matters for understanding the failure mode. At every generation step, a language model produces a probability distribution over its entire vocabulary — tens of thousands of tokens. Constrained decoding (the machinery behind JSON mode, structured outputs APIs, and frameworks like Outlines and XGrammar) works by masking that distribution before sampling. Tokens that would produce invalid output under your schema get zeroed out. The model can only pick from what remains valid.
This is implemented using finite state machines (FSMs) for JSON and regex patterns, or pushdown automata (PDAs) for more complex context-free grammars. Libraries like XGrammar — now the default in vLLM and SGLang — compile your schema into these automata ahead of time, achieving sub-40-microsecond token mask generation at inference time.
The problem is fundamental: the model's preferred token at any step might not be a valid token under your constraint. When the top 10 tokens are all masked, the model is forced to sample from lower-probability alternatives. Those alternatives are syntactically valid. They may be semantically wrong, stilted, or incomplete. Over the course of generating a response, these forced suboptimal selections accumulate.
Syntactic correctness is guaranteed. Semantic quality is not.
The Evidence for Quality Degradation
Research presented at NeurIPS 2024 measured constrained generation against free-form generation followed by parsing, and found 10 to 15 percent performance degradation on reasoning tasks under constrained conditions. The mechanism is exactly what you'd expect: when the model can't freely pick its preferred token, it makes incrementally worse choices, and those errors compound over multi-step reasoning.
This doesn't mean constrained generation always loses. For simpler extraction tasks — pulling named fields from text, classification into a fixed label set, structured data normalization — the quality hit is minimal. The task doesn't require the model to chain together reasoning steps where each token matters; it's filling a template. Constraints cost less when the answer space is already constrained by nature.
The hit is worst for tasks that require:
- Multi-step reasoning where the model's working space is the output itself (chain-of-thought flattened into a JSON field)
- Complex nested schemas with more than 10 fields or more than two nesting levels
- Open-ended generation trapped in a fixed string field (the model's creativity is penalized twice: by the schema and by the token masking)
Researchers have also identified three categories of structural output variation even within constrained generation: schema variation (the model generates a different field structure entirely), expression variation (semantic paraphrasing), and semantic variation (the underlying content changes meaning). Only the first is caught by schema validation.
The Other Side: Speed and Reliability
Constrained decoding isn't purely a cost. For simpler schemas, it's often faster. Modern implementations can achieve 50 percent latency reduction over unconstrained generation by skipping boilerplate. When the schema's scaffolding is fixed (curly braces, field names, quote marks), the model only needs to generate the values, and the constraint mechanism handles the rest. Speculative decoding techniques in the DOMINO algorithm push this further, enabling multi-token jumps for predictable structural regions.
The reliability improvement is real and significant:
| Approach | Parse failure rate |
|---|---|
| Prompt engineering only | 5–20% |
| JSON mode (no schema) | 1–5% |
| Constrained decoding with schema | <0.1% |
A team doing financial data extraction dropped validation failures from 27 percent to 2 percent by switching to constrained decoding — a 92 percent improvement. For systems where parsing failures require human remediation, that's a large operational win.
The question is whether you're making the right tradeoff for your workload. A 92 percent reduction in parse failures means little if your content accuracy also fell by 12 percent and you didn't measure it.
Provider Differences Matter
Providers implement structured outputs differently, and the differences have real consequences:
OpenAI (Strict mode, released Aug 2024): Server-side schema enforcement, mathematically guaranteed valid JSON output, lowest failure rate. The constraint is applied before the response reaches you.
Anthropic Claude: Structured outputs via tool use, not grammar-constrained decoding. The model is trained to follow tool schemas but isn't forced to by token masking. Failure rates are 0.5 to 5 percent depending on schema complexity. Claude's semantic quality on complex reasoning tasks tends to be better than natively constrained approaches, but you need client-side validation.
Google Gemini: Response schema with strict JSON enforcement, server-side, comparable to OpenAI's approach. Handles complex nested schemas well in benchmarks.
Mistral: JSON mode enforces shape but not strict schema compliance. Client-side validation required. Suitable for cost-sensitive workloads where occasional failures are acceptable.
For self-hosted inference: XGrammar (default in vLLM as of 2025) is the current production-grade choice. Outlines is simpler to use but has compilation timeouts on complex schemas — 40 seconds to over 10 minutes for certain patterns — which is a problem if schemas are user-defined.
The Schema Design Variable
How much the quality tax costs you isn't just a function of constrained vs. unconstrained. Schema design is the biggest lever within your control.
Quality degradation correlates strongly with schema complexity:
- 2–5 fields, flat structure: Less than 2% quality impact
- 10–20 fields, 2–3 nesting levels: 5–10% impact
- Deep nesting, large enums, many constraints: 10–20%+ impact
The most common mistake is translating your data model directly into a schema without considering what you're asking the model to do. A schema with 40 fields, half optional, with nested arrays and discriminated unions, is asking the model to navigate an enormous constraint space while also generating correct content. The cognitive load of the constraint and the cognitive load of the task are both real, and they compound.
The second most common mistake is burying the important content in a string field deep in a nested structure. Token masking applies to all tokens in the generation, including the ones inside your most important fields. If the model's best phrasing for a complex explanation is blocked by a constraint on an outer structural element, the quality of that explanation degrades.
When to Use Which Approach
Use constrained decoding when:
- Your downstream system writes directly to a database or calls an API and cannot tolerate parse failures
- The task is extraction or classification, not complex reasoning
- You can measure and confirm the quality tax is acceptable for your use case
- Schema complexity is low to moderate (under 15 fields, limited nesting)
Parse unstructured output when:
- The task requires multi-step reasoning or open-ended generation that fills most of the response
- You can implement retry logic to handle occasional parse failures
- Quality matters more than syntactic reliability for your application
- Your schema is complex enough to introduce meaningful compilation overhead
The hybrid approach is often the right answer in practice: generate with a minimal structural constraint (a flat JSON envelope with few required fields), and let the model write freeform into text fields. Validate the envelope, but don't try to constrain the content inside it. Reserve strict schema enforcement for the fields that actually require it — identifiers, labels, foreign keys — and treat explanatory text as unstructured prose.
Evaluating the Tax on Your System
The failure mode in production is teams that measure parse failure rates and assume they're measuring quality. They're measuring a proxy. Parse failure rate tells you nothing about whether the content of valid outputs is correct.
A proper evaluation runs the same tasks under constrained and unconstrained conditions, with task-level success metrics (not schema validity metrics), and measures the delta. A few practical steps:
Run 10 to 20 samples per task type. Single-sample comparisons hide variance. At 10 samples, you can detect a 10 percent quality difference with reasonable confidence. At 20, you can see 7 percent differences.
Measure task accuracy separately from parse success. Write an automated judge (or use a sample of human reviews) that evaluates whether the content of the output was correct — not just whether it parsed. The gap between these two metrics is your quality tax.
Test across temperature values. Constrained generation degrades faster at high temperatures. Research shows quality loss of 30 percent or more at temperature ≥ 0.7. For most production constrained output systems, temperature should be 0.1 to 0.3 for consistency. If your application requires higher temperature for diversity, constrained outputs may be particularly costly.
Benchmark your actual schema, not a simplified version. Benchmarks like JSONSchemaBench test 10,000 real-world schemas and show enormous variance. Average benchmark performance tells you nothing about whether your specific schema — with its specific nesting, optional fields, and enum sizes — will compile efficiently and run without quality loss.
Monitor semantic errors in production separately from parse errors. Semantic errors (valid JSON with wrong content) are silent. They don't trigger your error handlers. Building a separate evaluation track that samples structured outputs and validates content accuracy is the only way to detect them before they accumulate into a customer support problem.
Conclusion
Structured outputs solve a real problem: parsing failures are operationally expensive and user-facing failures are embarrassing. Constrained decoding closes that reliability gap. But the reliability improvement is syntactic, and the quality loss is semantic. Most teams measure the first and not the second, which is how a useful technique becomes an invisible drag on the product.
The right mental model is to treat constrained decoding as a trade, not a free upgrade. You're buying syntactic reliability. You're paying with some fraction of semantic quality. Whether that trade is good depends on your task, your schema complexity, and your success metric — and you can only know if you measure it directly on your workload.
For production systems, the default approach should be: constrain what you must, leave freeform what you can, and run the eval that tells you what you're actually giving up.
- https://arxiv.org/html/2501.10868v1
- https://arxiv.org/abs/2512.23712
- https://proceedings.neurips.cc/paper_files/paper/2024/file/2bdc2267c3d7d01523e2e17ac0a754f3-Paper-Conference.pdf
- https://arxiv.org/html/2403.06988v1
- https://www.lmsys.org/blog/2024-02-05-compressed-fsm/
- https://arxiv.org/pdf/2411.15100
- https://platform.claude.com/docs/en/build-with-claude/structured-outputs
- https://blog.vllm.ai/2025/01/14/struct-decode-intro.html
- https://applied-llms.org/
- https://python.useinstructor.com/blog/2024/09/26/bad-schemas-could-break-your-llm-structured-outputs/
- https://mbrenndoerfer.com/writing/constrained-decoding-structured-llm-output
- https://www.aidancooper.co.uk/constrained-decoding/
