Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring
Most teams flip JSON mode on the day they ship and never measure what it costs them. The assumption is reasonable: structured output is a correctness win, so why wouldn't you take it? The answer is that strict JSON-mode constrained decoding routinely shaves 5–15% off reasoning accuracy on math, symbolic, and multi-step analysis tasks, and nobody notices because the evals were run before the format flag was flipped — or the evals measure parseability, not quality.
The output format is a decoding-time constraint, and like every constraint it warps the model's probability distribution. The warp is invisible when you look at logs: the JSON is valid, the schema matches, the field types line up. What you cannot see in the logs is the reasoning that the model would have produced in prose but could not fit inside the grammar you gave it. The format tax is real, well-documented in the literature, and almost universally unmeasured in production.
This post is about when to pay it, how to stop paying it when you don't have to, and what a format-choice decision tree actually looks like for engineers who want structured output and accuracy at the same time.
The Mechanism: Why Constrained Decoding Warps Quality
At each decoding step, a language model produces a probability distribution over its vocabulary. Constrained decoding applies a logit mask: any token that would violate the target grammar gets its probability zeroed out, and the remaining tokens are renormalized. Structurally this is clean — the output is guaranteed to parse. Statistically it is a lie the model is being told about its own distribution, and the lie compounds.
Two specific failure modes show up in practice. The first is the tokenization seam. Models are trained on specific tokenizations of common strings — the word "because" tokenizes one way in prose, another way after a JSON opening quote, and the probability mass the model learned during training sits on the first tokenization, not the second. When the grammar forces the second tokenization path, the model is now operating on a sequence it rarely saw during pretraining. Output quality degrades subtly — grammatically valid, semantically thinner.
The second is the field-ordering trap. A JSON schema like {"answer": string, "reasoning": string} looks innocuous and is catastrophic: the model is forced to emit the answer token before it has generated any reasoning tokens. Chain-of-thought that the model would have used to check its work is now either absent (if the model treats the answer field as terminal) or fabricated post-hoc to match a decision that was already committed. The "Let Me Speak Freely?" paper (arxiv 2408.02442) documented substantial drops on GSM8K, Last Letter, and Shuffled Objects under stricter format constraints for exactly this reason.
The Empirical Gap Across Formats
Format choice is not neutral across tasks. Several measured results from 2024–2026 are worth internalizing before you pick a default:
- Reasoning tasks degrade under strict JSON. The "Let Me Speak Freely?" study showed that moving from natural-language prompting to format-restricting JSON instructions produced significant accuracy drops on reasoning benchmarks. The two-step "NL-to-Format" alternative — generate prose, then convert — recovered performance to near-unrestricted levels.
- Classification and extraction can improve under JSON. The same paper found that on DDXPlus and similar classification datasets, JSON-mode performed competitively and sometimes better, because the constraint collapses the output space in exactly the way the task wants.
- Format preference is model-dependent. The prompt-format study at arxiv 2411.10541 found GPT-3.5-turbo varied by up to 40% across templates on code translation, preferring JSON, while GPT-4 preferred markdown and was more robust to format variation overall. Larger models absorb the format tax better; smaller models wear it on their sleeves.
- Token cost is not a tie. Markdown uses roughly 34–38% fewer tokens than JSON for the same nested content, and XML needs about 80% more tokens than markdown. On output, you pay this tax twice — in dollars and in latency — every request.
- Constrained decoding is not universally bad. JSONSchemaBench showed that on tasks with minimal structure like GSM8K, constrained decoding can actually improve downstream performance by up to 4% by keeping the model on-rails, and can speed generation by ~50%. The picture is nuanced: the damage is specific to tasks where the grammar blocks reasoning paths the model would otherwise take.
If you take one calibration point away, take this: the format tax is heavier on small models, heavier on reasoning-heavy tasks, and heavier when the schema puts the answer before the reasoning. It is lighter on extraction, lighter on large frontier models, and lightest when the schema cooperates with the task.
The Three Formats, Ranked for What They Are Actually Good At
Markdown for reasoning and prose
When the model needs to think, markdown is the lowest-friction output format available. It imposes no grammar, no logit masks, no tokenization boundaries the model hasn't seen a billion times in pretraining. Bullets, headers, and code fences are native to the training distribution. For any task where the primary quality metric is the correctness or completeness of reasoning — agent planning, multi-step debugging, customer support that requires weighing context — markdown should be the default.
Markdown's weakness is machine consumption. Parsing markdown into typed fields downstream is fragile. If the next step in your pipeline is a human reading the output, markdown wins. If the next step is a function that needs {city: string, temperature: number}, markdown loses.
XML tags for extraction and boundary-setting
XML tags occupy a useful middle ground. They are structured enough to delimit fields reliably — <quote>...</quote>, <answer>...</answer>, <reasoning>...</reasoning> — but lightweight enough that they don't force constrained decoding. The model is writing prose with markers, not navigating a grammar.
Anthropic specifically trained Claude to recognize XML tags as a prompt-organizing mechanism, and the pattern works in both directions: as input structure and as output structure. For extraction tasks ("pull all the dates mentioned in this document") and for tasks with mixed prose + structured content (a reasoning block followed by a structured summary), XML tags give you 90% of the parseability of JSON at roughly the quality of markdown.
The weakness is verbosity. XML is the most token-hungry of the three formats. If you are paying per-token on output, an XML-heavy response is the most expensive choice — about twice the cost of the markdown equivalent.
Strict JSON (schema-enforced) for machine-to-machine handoffs
Strict JSON mode — whether OpenAI's structured outputs, Anthropic's tool-calling, or an FSM-based library like Outlines — is the right choice when the output must flow directly into code without a human in the loop. API responses, tool arguments, database writes, inter-service messages. The downstream consumer is a parser, and the cost of a malformed field is higher than the cost of a few percent of reasoning quality.
The trick is not to reach for strict JSON any earlier than that. A system prompt that says "respond in JSON" with a loose example is a very different animal from strict schema-enforced decoding with logit masking. The loose version pays most of the format tax (the model is still biased toward JSON-shaped outputs) with few of the guarantees. The strict version pays the full tax and gets the full guarantee. The middle — "JSON mode" without schema — is often the worst of both worlds.
The Dual-Pass Pattern: Have Both
The NL-to-Format pattern is the single highest-ROI prompt-engineering move most teams haven't made. It works in two calls:
- Pass one: generate freely. Prompt the model to produce the answer in markdown or free prose, with whatever chain-of-thought it wants. No format constraint. Use a reasoning-capable model if the task warrants.
- Pass two: convert. Feed the markdown output into a second, cheaper model call with a strict JSON schema. The task is extraction, not reasoning — and extraction is exactly the task strict JSON is good at.
Three things make this pattern attractive in production:
- The reasoning pass runs against the model's strongest decoding path, recovering the accuracy lost under constrained decoding.
- The conversion pass is cheap. It's a small model, short output, trivial task. The added latency is usually 200–500ms; the added cost is a fraction of the reasoning pass.
- Failure modes become observable. If the first pass produces bad reasoning, you can see it. If the second pass drops a field, you can retry it against the cached prose output without re-running the expensive call.
Teams sometimes object on latency grounds. The objection is worth engaging with — a second hop does add time — but it's rarely the right bottleneck to optimize. If reasoning quality is the actual constraint on your product, losing 10% of it to save 300ms on latency is a bad trade. If latency is the actual constraint, the first question is whether you need reasoning at all, not whether to shave seconds from a poorly-reasoning pipeline.
When the Field Order Alone Fixes It
If you insist on a single-pass structured output — and for many latency-sensitive applications that's the right call — the cheapest fix is schema field order. Put the reasoning field first:
{
"reasoning": "...",
"final_answer": "..."
}
not
{
"final_answer": "...",
"reasoning": "..."
}
This is the same advice OpenAI gives in their structured outputs documentation, and it works because the model literally generates tokens in order. A reasoning field placed before the answer gives the model room to think inside the structured output. A reasoning field placed after the answer is a fiction — the model is writing a justification for a decision it already committed to.
This one change recovers most of the reasoning-quality loss for free. It doesn't require a second call, doesn't change your pipeline, and is a one-line schema edit. The teams who most often miss it are the ones who designed their schema to match their downstream type shape, where the answer naturally comes first, and never thought to reorder it. Reorder it.
A Decision Tree You Can Use Monday
The question to ask before picking a format is "what does the next consumer of this output need?"
- Human reader, reasoning-heavy: markdown, single pass.
- Human reader with mixed structured + prose: XML tags, single pass.
- Machine consumer, reasoning-light (classification, extraction, tool arguments): strict JSON, single pass, reasoning field first if any.
- Machine consumer, reasoning-heavy (planning, multi-step analysis, code generation where the output is parsed): two-pass — markdown reasoning, JSON extraction.
- Unsure which category: default to markdown + cheap conversion pass. The downside is small; the upside is recovering any reasoning quality the format would have cost.
The meta-point is that output format is not a deployment detail. It is a decoding-time intervention that trades one axis of quality for another, and you should know which axis you traded. The teams who get this right run an eval on both formats before flipping the flag. The teams who don't discover the tax six months later when a user flags that the model "used to be smarter."
Measuring the Tax on Your Own Pipeline
The practical ask is small. Pick your top five production prompts. Run each against three variants: free prose output, strict JSON output, and dual-pass (prose then JSON extraction). Measure the metric you actually care about — task accuracy, user satisfaction, downstream-action success rate — not format compliance. Compare.
Two things usually happen. On classification and extraction, the three variants are within noise of each other, and JSON wins on cost. On reasoning-heavy tasks, dual-pass wins decisively and strict JSON is the worst of the three. You now have data, and the decision of whether to flip JSON mode stops being ideological.
The format tax is the kind of latent cost that looks like a free lunch at deploy time and accrues quietly in every eval regression you can't quite explain. Paying it deliberately, on tasks where it doesn't hurt, is good engineering. Paying it universally, because JSON feels "more professional," is the thing to stop doing.
- https://arxiv.org/abs/2408.02442
- https://arxiv.org/abs/2411.10541
- https://arxiv.org/abs/2501.10868
- https://www.improvingagents.com/blog/best-nested-data-format/
- https://checksum.ai/blog/does-output-format-actually-matter-an-experiment-comparing-json-xml-and-markdown-for-llm-tasks
- https://blog.dottxt.ai/coalescence.html
- https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/use-xml-tags
- https://openai.com/index/introducing-structured-outputs-in-the-api/
