Skip to main content

Beyond JSON Mode: Getting Reliable Structured Outputs from LLMs in Production

· 9 min read
Tian Pan
Software Engineer

You deploy a pipeline that extracts customer intent from support tickets. You've tested it extensively. It works great. Three days after launch, an alert fires: the downstream service is crashing on KeyError: 'category'. The model started returning ticket_category instead of category — no prompt change, just a model update your provider rolled out silently.

This is the structured output problem. And JSON mode doesn't solve it.

Getting LLMs to produce machine-readable output that reliably conforms to a specific shape is one of those problems that looks trivially easy — "just tell it to return JSON" — until it breaks in production at 3am. The failure modes are subtle, the solutions are layered, and the tradeoffs between approaches matter enormously depending on whether you're running cloud APIs or self-hosted inference.

Why "Return JSON" Is Not a Strategy

The instinct to solve structured output with a prompt like "Respond only with valid JSON in the following format: ..." is understandable. It works often enough in testing to feel solved. But prompt-only JSON extraction has failure rates of 5–20% depending on schema complexity, and those failures cluster in the worst possible ways.

The most common failures:

  • Preamble contamination. The model outputs "Sure! Here's the JSON you requested: {...}" — syntactically broken for any JSON parser.
  • Hallucinated keys. The model invents field names that weren't in the schema. You asked for status, you get current_state. Both make semantic sense to the model; only one is correct for your parser.
  • Missing required fields. Instead of returning null for fields it doesn't know, the model simply omits them.
  • Type drift. You get "42" (a string) where you needed 42 (an integer).
  • Truncation mid-structure. On long outputs that approach the token limit, the model closes unceremoniously, leaving unclosed braces.

JSON mode (response_format: { type: "json_object" }) solves the first problem — it prevents obviously malformed JSON — but does nothing about the other four. It promises valid JSON syntax, not schema compliance.

The Four Generations of Solutions

The field has evolved through four distinct approaches, each solving a different layer of the problem.

Generation 1: Prompt Engineering

The baseline. Works at low volume with simple schemas and forgiving consumers. Fails at scale, with complex schemas, and after model updates. The correct mental model is that prompt-based approaches are treating schema compliance as a soft instruction to the model, not as a hard constraint. The model can and will deviate.

Generation 2: Function Calling and Tool Use

Starting in 2023, major providers began fine-tuning models to generate function arguments that conform to a JSON schema. When you define a tool or function with a schema, the model produces syntactically valid output that closely follows the schema. This isn't a mathematical guarantee — the model is fine-tuned to follow schemas as instructions — but reliability jumps dramatically compared to prompt-only approaches.

Function calling is currently the practical default for most cloud API users. It works across OpenAI, Anthropic, Cohere, and Groq with largely consistent behavior, though edge cases differ.

Generation 3: Native Schema-Enforced APIs

In mid-2024, both OpenAI and Google shipped API-level schema enforcement. OpenAI's response_format: { type: "json_schema" } with strict: true and Google Gemini's response_schema parameter guarantee 100% schema compliance for accepted schemas. These APIs use constrained decoding under the hood — invalid tokens are masked at each generation step, making schema violations literally impossible.

There are meaningful limitations. OpenAI's implementation accepts only a subset of JSON Schema: minLength, maxLength, minItems, maxItems, and complex regex patterns are excluded. Schemas must set additionalProperties: false on all objects, or the model may generate extra keys. Neither provider's implementation supports the full breadth of real-world JSON Schema.

Generation 4: Constrained Decoding for Self-Hosted Models

For teams running local inference, constrained decoding operates at the inference engine level. Instead of prompting or fine-tuning, the engine modifies token probability distributions at generation time — masking every token that would produce output violating the target grammar.

Two algorithmic approaches dominate:

Finite State Machine (FSM), pioneered by Outlines, compiles a JSON Schema or regex into a state machine. At each generation step, only tokens that keep the FSM in a valid state are allowed. Compilation is the expensive operation — complex schemas can take 8–60 seconds to compile; pathological schemas with large enum unions have been measured at over 10 minutes. Generation itself is cheap, and grammars are cacheable.

Pushdown Automaton / Context-Free Grammar, used by XGrammar (from CMU's MLC team), handles context-free grammars with stack-based state tracking. XGrammar is now the default structured output backend in vLLM, with benchmarks showing up to 100x speedup over naive grammar-constrained methods and near-zero per-token latency overhead. llguidance, Microsoft's grammar library, powers OpenAI's own implementation and achieves 6–9ms per-token latency versus 15–46ms for Outlines-based approaches.

Choosing a Library

The library landscape is mature and opinionated. Your choice should depend on your deployment model.

For cloud APIs (OpenAI, Anthropic, etc.): Instructor is the pragmatic default. It wraps 15+ providers with a consistent interface, integrates Pydantic for schema definition and validation, and includes configurable retry logic. With over 3 million monthly downloads, it's battle-tested across diverse production workloads. The API feels like a thin wrapper over the native client rather than an abstraction — you keep control.

For local high-throughput inference: Use vLLM with XGrammar (the default since vLLM 0.6.x) or Outlines as the fallback. Both provide token-level guarantees. Outlines additionally has a clean Python API for defining schemas as Pydantic models.

For maximum schema coverage on local models: Guidance (backed by llguidance) supports a broader subset of JSON Schema than other frameworks and achieves the lowest per-token latency in independent benchmarks — important when throughput is a constraint.

For the simplest possible API on OpenAI: Marvin offers a cast() / extract() / classify() interface that requires minimal boilerplate. Good for internal tooling; too opinionated for multi-provider production use.

A 2025 benchmark of 10,000 real-world JSON schemas from GitHub and Kubernetes production configs found that the best-performing constrained decoding framework supports roughly twice as many schemas as the worst-performing one — a meaningful gap if your application uses non-trivial schemas.

Schema Design Matters More Than You Think

Even with perfect enforcement at the syntax level, poorly designed schemas create runtime pain. A few principles that hold across all approaches:

Keep it flat. Deeply nested schemas confuse both prompt-guided models and constrained decoders. If you have a schema that's 5 levels deep, consider whether it genuinely needs to be — or whether denormalization would serve your consumers better.

Name keys intuitively. Models are trained on real-world JSON. Keys that match common conventions (first_name, status, created_at) produce more semantically correct fill-in than opaque abbreviations (fn, st, ca). With constrained decoding, key names don't affect syntactic compliance — but they affect whether the content is correct.

Use enums aggressively. For any field with a finite set of valid values, use an enum type. It benefits both approaches: constrained decoders can enumerate allowed tokens, and prompt-guided models see a clear constraint.

Mark all fields as required explicitly. Don't rely on models inferring optionality. If a field can be absent, give it a nullable type with an explicit null option rather than making the field optional.

Set additionalProperties: false everywhere. This is required by OpenAI's strict mode and is good practice universally. Without it, models will invent keys.

Avoid anyOf and oneOf on complex schemas. Union types dramatically increase FSM compilation complexity. Where possible, use a discriminated union (a type field that determines the shape) or flatten the union into separate calls.

Add description fields. Even with constrained decoding guaranteeing structure, the model still fills in content. Schema property descriptions help the model understand what to put in each field — particularly useful for fields where the key name is ambiguous.

The Ceiling Every Approach Hits

There's a limitation that no enforcement technique can solve: constrained decoding guarantees syntactic conformance, not semantic correctness.

A system with perfect schema enforcement can reliably produce {"sentiment": "positive"} — valid JSON, correct type, valid enum value. Whether the sentiment label is actually correct for the input text is a completely separate question. The schema cannot express "the sentiment should be accurate." That's a content quality problem, and it requires evals and LLM-as-judge tooling, not schema enforcement.

This distinction matters when teams celebrate "100% structured output reliability" without measuring whether the structured values are actually right. Syntactic reliability is necessary but not sufficient for production-quality outputs.

Production Patterns That Hold Up

Retry with error context. When operating without constrained decoding, catch parse failures and re-prompt with the error message and a compact inline schema example. Most models can self-correct on a second attempt. Three retries are sufficient; beyond that, the model is unlikely to recover on this input.

json_repair as a buffer. Before attempting to parse, run the response through json_repair — a library that patches common minor syntax errors (trailing commas, unquoted keys, missing closing braces). This handles a significant fraction of generation failures without a full retry round-trip.

Cache compiled grammars. If you're running Outlines or XGrammar, compilation is expensive and generation is cheap. Compile schemas at startup and reuse them. A schema that takes 10 seconds to compile costs nothing on subsequent requests.

Schema versioning. Treat schemas as versioned interfaces, not incidental JSON blobs. Adding a required field is a breaking change. Changing a field type is a breaking change. When schemas evolve, version them explicitly and migrate consumers deliberately.

Monitor for semantic drift, not just parse failures. After solving syntactic compliance, the remaining risk is that structurally valid values are semantically wrong — extracted entities that don't exist in the source text, classifications that shifted after a model update. Set up downstream consistency checks that catch these silently incorrect outputs before they propagate.

Where to Start

If you're using cloud APIs and currently relying on raw JSON mode: adopt Instructor with strict: true and Pydantic models for your schemas. The migration is mechanical and the reliability improvement is immediate.

If you're running local inference at meaningful scale: move to vLLM with XGrammar. The infrastructure overhead pays for itself in reduced retry latency and eliminated parser error handling.

Either way, start with flat schemas, explicit required fields, and additionalProperties: false. Those three decisions eliminate the majority of production structured output failures before any library or API change.

The "just ask for JSON" era is over. The tooling to do this correctly is mature, well-documented, and in most cases requires minimal code changes to adopt.

References:Let's stay in touch and Follow me for more thoughts and updates