Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring

April 23, 2026 · 11 min read

Software Engineer

Most teams flip JSON mode on the day they ship and never measure what it costs them. The assumption is reasonable: structured output is a correctness win, so why wouldn't you take it? The answer is that strict JSON-mode constrained decoding routinely shaves 5–15% off reasoning accuracy on math, symbolic, and multi-step analysis tasks, and nobody notices because the evals were run before the format flag was flipped — or the evals measure parseability, not quality.

The output format is a decoding-time constraint, and like every constraint it warps the model's probability distribution. The warp is invisible when you look at logs: the JSON is valid, the schema matches, the field types line up. What you cannot see in the logs is the reasoning that the model would have produced in prose but could not fit inside the grammar you gave it. The format tax is real, well-documented in the literature, and almost universally unmeasured in production.

This post is about when to pay it, how to stop paying it when you don't have to, and what a format-choice decision tree actually looks like for engineers who want structured output and accuracy at the same time.

The Mechanism: Why Constrained Decoding Warps Quality

At each decoding step, a language model produces a probability distribution over its vocabulary. Constrained decoding applies a logit mask: any token that would violate the target grammar gets its probability zeroed out, and the remaining tokens are renormalized. Structurally this is clean — the output is guaranteed to parse. Statistically it is a lie the model is being told about its own distribution, and the lie compounds.

Two specific failure modes show up in practice. The first is the tokenization seam. Models are trained on specific tokenizations of common strings — the word "because" tokenizes one way in prose, another way after a JSON opening quote, and the probability mass the model learned during training sits on the first tokenization, not the second. When the grammar forces the second tokenization path, the model is now operating on a sequence it rarely saw during pretraining. Output quality degrades subtly — grammatically valid, semantically thinner.

The second is the field-ordering trap. A JSON schema like {"answer": string, "reasoning": string} looks innocuous and is catastrophic: the model is forced to emit the answer token before it has generated any reasoning tokens. Chain-of-thought that the model would have used to check its work is now either absent (if the model treats the answer field as terminal) or fabricated post-hoc to match a decision that was already committed. The "Let Me Speak Freely?" paper (arxiv 2408.02442) documented substantial drops on GSM8K, Last Letter, and Shuffled Objects under stricter format constraints for exactly this reason.

The Empirical Gap Across Formats

Format choice is not neutral across tasks. Several measured results from 2024–2026 are worth internalizing before you pick a default:

Reasoning tasks degrade under strict JSON. The "Let Me Speak Freely?" study showed that moving from natural-language prompting to format-restricting JSON instructions produced significant accuracy drops on reasoning benchmarks. The two-step "NL-to-Format" alternative — generate prose, then convert — recovered performance to near-unrestricted levels.
Classification and extraction can improve under JSON. The same paper found that on DDXPlus and similar classification datasets, JSON-mode performed competitively and sometimes better, because the constraint collapses the output space in exactly the way the task wants.
Format preference is model-dependent. The prompt-format study at arxiv 2411.10541 found GPT-3.5-turbo varied by up to 40% across templates on code translation, preferring JSON, while GPT-4 preferred markdown and was more robust to format variation overall. Larger models absorb the format tax better; smaller models wear it on their sleeves.
Token cost is not a tie. Markdown uses roughly 34–38% fewer tokens than JSON for the same nested content, and XML needs about 80% more tokens than markdown. On output, you pay this tax twice — in dollars and in latency — every request.
Constrained decoding is not universally bad. JSONSchemaBench showed that on tasks with minimal structure like GSM8K, constrained decoding can actually improve downstream performance by up to 4% by keeping the model on-rails, and can speed generation by ~50%. The picture is nuanced: the damage is specific to tasks where the grammar blocks reasoning paths the model would otherwise take.

If you take one calibration point away, take this: the format tax is heavier on small models, heavier on reasoning-heavy tasks, and heavier when the schema puts the answer before the reasoning. It is lighter on extraction, lighter on large frontier models, and lightest when the schema cooperates with the task.

The Three Formats, Ranked for What They Are Actually Good At

Markdown for reasoning and prose

When the model needs to think, markdown is the lowest-friction output format available. It imposes no grammar, no logit masks, no tokenization boundaries the model hasn't seen a billion times in pretraining. Bullets, headers, and code fences are native to the training distribution. For any task where the primary quality metric is the correctness or completeness of reasoning — agent planning, multi-step debugging, customer support that requires weighing context — markdown should be the default.

Markdown's weakness is machine consumption. Parsing markdown into typed fields downstream is fragile. If the next step in your pipeline is a human reading the output, markdown wins. If the next step is a function that needs {city: string, temperature: number}, markdown loses.

XML tags for extraction and boundary-setting

XML tags occupy a useful middle ground. They are structured enough to delimit fields reliably — <quote>...</quote>, <answer>...</answer>, <reasoning>...</reasoning> — but lightweight enough that they don't force constrained decoding. The model is writing prose with markers, not navigating a grammar.

Anthropic specifically trained Claude to recognize XML tags as a prompt-organizing mechanism, and the pattern works in both directions: as input structure and as output structure. For extraction tasks ("pull all the dates mentioned in this document") and for tasks with mixed prose + structured content (a reasoning block followed by a structured summary), XML tags give you 90% of the parseability of JSON at roughly the quality of markdown.

The weakness is verbosity. XML is the most token-hungry of the three formats. If you are paying per-token on output, an XML-heavy response is the most expensive choice — about twice the cost of the markdown equivalent.

Strict JSON (schema-enforced) for machine-to-machine handoffs

Strict JSON mode — whether OpenAI's structured outputs, Anthropic's tool-calling, or an FSM-based library like Outlines — is the right choice when the output must flow directly into code without a human in the loop. API responses, tool arguments, database writes, inter-service messages. The downstream consumer is a parser, and the cost of a malformed field is higher than the cost of a few percent of reasoning quality.

The trick is not to reach for strict JSON any earlier than that. A system prompt that says "respond in JSON" with a loose example is a very different animal from strict schema-enforced decoding with logit masking. The loose version pays most of the format tax (the model is still biased toward JSON-shaped outputs) with few of the guarantees. The strict version pays the full tax and gets the full guarantee. The middle — "JSON mode" without schema — is often the worst of both worlds.

The Dual-Pass Pattern: Have Both

The NL-to-Format pattern is the single highest-ROI prompt-engineering move most teams haven't made. It works in two calls:

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring

The Mechanism: Why Constrained Decoding Warps Quality

The Empirical Gap Across Formats

The Three Formats, Ranked for What They Are Actually Good At

Markdown for reasoning and prose

XML tags for extraction and boundary-setting

Strict JSON (schema-enforced) for machine-to-machine handoffs

The Dual-Pass Pattern: Have Both

Recommended Reading

About Tian Pan

The Mechanism: Why Constrained Decoding Warps Quality​

The Empirical Gap Across Formats​

The Three Formats, Ranked for What They Are Actually Good At​

Markdown for reasoning and prose​

XML tags for extraction and boundary-setting​

Strict JSON (schema-enforced) for machine-to-machine handoffs​

The Dual-Pass Pattern: Have Both​

Recommended Reading

About Tian Pan

The Mechanism: Why Constrained Decoding Warps Quality

The Empirical Gap Across Formats

The Three Formats, Ranked for What They Are Actually Good At

Markdown for reasoning and prose

XML tags for extraction and boundary-setting

Strict JSON (schema-enforced) for machine-to-machine handoffs

The Dual-Pass Pattern: Have Both