Skip to main content

The Context Format Decision Most Teams Make Accidentally: JSON vs Markdown vs Plain Text

· 9 min read
Tian Pan
Software Engineer

Most teams pick a context format once, early in development, and never revisit it. A developer reaches for JSON because it looks structured and machine-readable. Another grabs markdown because it's what they use in README files. Plain text gets chosen when nothing else seems necessary. These are not engineering decisions — they're habits. And they silently shape how your model reasons.

The format you pass to an LLM is not inert packaging. It is an instruction. Structured JSON context primes the model toward schema-following behavior. Markdown encourages hierarchical synthesis. Plain text opens up more flexible inference. Getting this wrong by even one format category can degrade accuracy by 40% or more — and you won't see the error in your logs.

Format Changes How Models Think, Not Just What They Output

The most consequential finding from recent empirical work on prompt formatting is that format affects reasoning mode, not just output structure. When you feed a model JSON-encoded context, you are implicitly asking it to operate in schema-extraction mode. That's useful when you need structured data back, but it actively suppresses the flexible inference that makes LLMs good at tasks requiring synthesis, ambiguity resolution, or code generation.

Benchmark results across GPT-3.5 and GPT-4 models show variance exceeding 300% on specific tasks depending solely on format choice. On code generation benchmarks, GPT-4 performs optimally with plain text context and degrades significantly with JSON encoding of the same information. On table comprehension tasks, Markdown key-value pairs outperform CSV by nearly 20 percentage points — not because the underlying data differs, but because the format gives the model different structural priming.

The effect varies by model. GPT-4 favors markdown and shows relatively stable behavior across format changes. GPT-3.5 is more format-sensitive and tends to perform better with JSON. Open-source models at smaller parameter counts (around 3B) are largely format-agnostic, showing less than 5% variance across JSON, YAML, XML, and markdown — but also achieving lower absolute accuracy, which means the format question matters more as you move to more capable models.

JSON Is Not the Safe Default You Think It Is

JSON feels rigorous. It enforces structure, it's machine-parseable, and it signals that your system is "doing things properly." But this intuition leads teams into two distinct failure modes.

The first is reasoning constraint. When structured JSON context is passed to a model on a task that requires flexible reasoning — creative synthesis, open-ended analysis, code generation — the model's token distribution gets pulled toward schema-consistent completions. It fills fields instead of reasoning freely. This is the same mechanism that makes structured outputs useful: the model stays in the rails you defined. But when the rails constrain rather than guide, you get outputs that are syntactically valid and semantically flat.

The second failure mode is schema fragility. JSON contexts without explicit schema enforcement produce outputs where field names drift, types coerce silently, and required fields go missing. Research on agentic systems shows that malformed JSON — missing braces, incorrect escaping, type mismatches — accounts for more than 60% of agent failures in production. The model doesn't know your downstream parser expects an integer where it wrote a string. The fix is explicit schema validation at the boundary, with error feedback that forces self-correction. Enforced JSON schema validation can bring output compliance from under 40% to near 100%. But that's not a reason to prefer JSON; it's a reason to treat JSON output as a reliability engineering problem rather than a default assumption.

A practical pattern that works: use JSON strictly for outputs that require machine parsing, enforce it with explicit schema validation and correction loops, and never use it as the input format for context that the model needs to reason over.

Markdown's Hidden Advantage

Markdown performs well in most reasoning tasks for a non-obvious reason: it's the format LLMs were trained on most heavily. GitHub repositories, Jupyter notebooks, documentation sites, and Stack Overflow answers are all markdown-heavy. When you write context in markdown, you are writing in the model's native register.

This shows up in benchmarks. For table data, markdown key-value format achieves around 61% accuracy compared to CSV's 44% on the same comprehension tasks. For RAG retrieval, markdown chunking with heading-based boundaries improves retrieval accuracy by up to 35% over unstructured text. For chain-of-thought reasoning, markdown with numbered steps outperforms JSON-wrapped reasoning chains.

Markdown also has a token efficiency advantage over JSON. For equivalent information, markdown typically uses 30–40% fewer tokens. JSON's overhead from property names, quotes, and structural punctuation adds up. At scale, choosing JSON over markdown for context encoding doubles inference costs without accuracy gains on most tasks.

The case against markdown is straightforward: it has no native schema enforcement, and if your downstream system requires structured output, you need a separate parsing layer. That's not a reason to avoid markdown in the context itself — it's a reason to keep output format separate from context format, which most systems should do anyway.

When Plain Text Wins

The most counterintuitive finding in the literature is that plain text sometimes outperforms both JSON and markdown on tasks where you'd expect structure to help.

On code generation benchmarks, both GPT-3.5 and GPT-4 score higher with plain text context than with markdown or JSON formatting of the same underlying information. The effect is particularly strong for GPT-4 on the HumanEval benchmark. The hypothesis is that structured formatting constrains the model's token generation in the early decoding steps, before it has committed to an approach. Plain text leaves more room for the model to work through the problem before committing to output structure.

Plain text also handles factual retrieval tasks robustly. When you're passing a dense block of reference material and asking the model to extract a specific fact, plain text often works as well as more structured alternatives — provided the material is coherent and the retrieval question is clear. Plain text fails reliably on multi-entity disambiguation: when the context contains multiple entities with similar attributes, and the model needs to track which property belongs to which entity, unstructured text produces confabulation at a higher rate than keyed formats.

The practical heuristic: use plain text when you need the model to reason over prose freely, and switch to a keyed format (markdown-KV, YAML, or minimal JSON) when the context contains multiple entities or structured records that need unambiguous attribution.

The Mixing Problem

Most real agent systems don't pick one format — they mix them. A system prompt in markdown, retrieved context in JSON from an API response, tool results in structured text, and conversation history in plain text. This mixture is where format-induced confabulation becomes a real risk.

When context contains inconsistently formatted sections, models have a tendency to apply the reasoning mode from one section to another. JSON-formatted API results mixed with markdown instructions can cause the model to treat the markdown as data to be extracted rather than instructions to follow. Switching between list-based and numbered formatting within a single prompt introduces inconsistencies that, in practice, increase hallucination rates.

The engineering discipline that addresses this is format consistency within each logical layer. System instructions in a consistent style (markdown or plain text, your choice, but not both). Retrieved context in a consistent format with explicit section headers that tell the model what kind of content follows. Tool results handled the same way every time, with validation before they enter the context window.

The broader principle: the cost of mixing formats is a reasoning load on the model. Every format switch is an implicit signal that the model has to interpret before proceeding. Minimize that load by standardizing within each context layer, and always provide explicit transition cues ("Below is structured data from the API:", "The following are instructions:") when mixing is unavoidable.

A Decision Framework That Works in Practice

Most teams should think about format across three distinct surfaces: context inputs, chain-of-thought workspace, and structured outputs.

For context inputs — the information you pass to orient the model — markdown is the best default for prose-based context, markdown key-value for structured records, and YAML when you need schema-like structure with more token efficiency than JSON. Reserve JSON input for cases where the data came from a machine and you're passing it through with minimal transformation.

For chain-of-thought workspace — the reasoning scratchpad, if your system uses one — plain text or markdown with numbered steps consistently outperforms JSON-wrapped reasoning. The model needs freedom to explore before committing.

For structured outputs — what the model hands off to downstream systems — JSON with explicit schema enforcement is the right call. Use a validation gate that returns schema violations as feedback, not silent failures. This is where the structure actually serves a purpose: downstream code expects a contract, and the schema is that contract.

The format selection discipline that works is to make the choice explicit for each surface, test it with your specific model and tasks before committing it to production, and treat format as a hyperparameter that can be measured — not a stylistic preference that doesn't affect outcomes.

Evaluating Format Choice Before You Lock It In

Format effects are highly task-specific and model-specific. The only reliable way to know which format serves your production case is to measure it. Set up a minimal eval that runs your representative prompts in each candidate format against your target model. Measure accuracy, latency, token cost, and output compliance. Run it before you commit a format to production.

This is not a one-time exercise. Model updates change format sensitivity. The transition from GPT-4 to newer reasoning models (o3, o4) has reduced format sensitivity for some tasks — but not eliminated it. Baseline your format performance on each model upgrade, especially if you observe quality regressions that don't have an obvious explanation.

The teams that find format-induced regressions tend to find them accidentally — after a model upgrade, after switching between API providers, or after adding a new data source that changes the context composition. The teams that don't get surprised are the ones that treat format as a measurable system property with a benchmark attached to it, not a historical accident inherited from whoever wrote the first prototype.

Format is not neutral. Treat it like the engineering decision it is.

References:Let's stay in touch and Follow me for more thoughts and updates