Skip to main content

Structured Outputs in Production: Engineering Reliable JSON from LLMs

· 10 min read
Tian Pan
Software Engineer

LLMs are text generators. Your application needs data structures. The gap between those two facts is where production bugs live.

Every team building with LLMs hits this wall. The model works great in the playground — returns something that looks like JSON, mostly has the right fields, usually passes a JSON.parse. Then you ship it, and your parsing layer starts throwing exceptions at 2am. The response had a trailing comma. Or a markdown code fence. Or the model decided to add an explanatory paragraph before the JSON. Or it hallucinated a field name.

The industry has spent three years converging on solutions to this problem. This is what that convergence looks like, and what still trips teams up.

The Three Maturity Levels

There's a clear progression in how teams approach structured output, and each level has a real reliability ceiling.

Level 1: Prompt engineering. You write "respond only with valid JSON in this format:" and show an example. This works 80–95% of the time on simple schemas. The failure modes are subtle: the model adds a preamble on complex prompts, wraps JSON in a code block when the schema gets long, or silently omits optional fields. You add a regex cleanup step and a try/catch, and you convince yourself it's fine.

It is not fine for anything serious. A 95% parse success rate sounds high until you have a 10-step agent chain: 0.95^10 ≈ 0.60. Six in ten agent runs fail to complete. The math is unforgiving.

Level 2: Function calling / tool use. All major providers expose an API where you define a JSON schema and the model is supposed to fill it. This gets you to 95–99% reliability. The catch: the schema is a hint, not a constraint. The model sees the schema as part of its context and learns to follow it — but nothing in the decoding process prevents it from generating invalid tokens. Providers can still return a malformed payload, especially with complex schemas or edge-case inputs.

Level 3: Native structured output with constrained decoding. This is where 100% schema validity becomes mathematically guaranteed. The inference engine builds a finite state machine from your schema and masks invalid tokens at every generation step. The model literally cannot produce output that fails to parse. OpenAI's response_format with json_schema, Gemini's response_schema, and open-source frameworks like Outlines use this approach.

If you're building anything that needs reliable downstream parsing — classification pipelines, agent tool calls, data extraction — you want Level 3.

How Constrained Decoding Actually Works

The implementation is worth understanding because it shapes what schemas you can and can't use.

At each generation step, the model produces a probability distribution over its entire vocabulary (50,000+ tokens). Normally, you sample from that distribution. With constrained decoding, you first build a finite state machine representing every valid path through your JSON schema. Before sampling, you compute a token mask: a boolean vector where false means "this token cannot appear here given the current state in the FSM." You zero out those logits and sample from what remains.

The result: the model can only ever produce tokens that advance toward valid completion of the schema. It's not post-processing — it's baked into every single decoding step.

The practical overhead was a concern early on. Building the initial FSM for a complex schema can take 50–200ms. But engines like XGrammar (from the MLC team) achieve token mask generation in under 40 microseconds per token, and subsequent requests reuse a cached FSM with near-zero overhead. For simple schemas, the latency impact is under 5%. For deeply nested schemas with large enum sets, it can reach 30–60% — which is a real signal to simplify your schema.

Schema Design: Where Teams Go Wrong

Even with constrained decoding enforcing syntactic validity, bad schema design causes semantic failures. These are the patterns that bite most teams:

Put reasoning before conclusions. If your schema has a reasoning field and a classification field, put reasoning first. LLMs generate tokens left to right. When the model writes out its reasoning before committing to a classification, it produces better classifications. If you put the answer field first, the model commits to a label before thinking, then rationalizes in the reasoning field. This sounds like an LLM quirk but it consistently shifts accuracy by several percentage points.

Flatten your schema. Nesting is the enemy of reliability. OpenAI's native structured output caps at 5 levels of nesting and 100 total properties. Beyond that, grammar compilation time spikes and per-token overhead grows. More importantly, deeply nested schemas with 4+ levels have measurably higher error rates even with constrained decoding — the model has more opportunities to lose track of context. If your schema is deeply nested, ask whether the nesting reflects actual data hierarchy or just organizational preference.

Describe every field. Pydantic's Field(description=...) values are passed to the model as inline instructions. Without descriptions, the model infers semantics from field names alone. confidence: float — is that 0–1 or 0–100? status: str — what are the valid values? Field descriptions are not documentation; they are prompt instructions that directly affect output quality.

Handle optionality explicitly. OpenAI's structured output does not support optional fields the same way you'd expect. If a field can be absent, model it as Optional[str] with a default of None, not just str | None without a default. Providers handle the distinction differently, and getting this wrong produces cryptic "Invalid schema" errors at runtime.

Avoid complex patterns. Regex-constrained fields with complex patterns, oneOf with many branches, and recursive schemas create combinatorial explosions in the FSM. If you need "one or more items matching a pattern," consider splitting the problem into multiple sequential calls rather than expressing it in a single schema.

The Provider Landscape in Practice

Each major provider has a different API surface, and the abstractions do not translate cleanly across providers.

OpenAI offers the most mature implementation. Use client.beta.chat.completions.parse() with a Pydantic model — it handles schema conversion and gives you a typed Python object back. The response_format approach with raw JSON schemas also works but requires manual schema construction. The .parse() method is the right default.

Anthropic does not have a dedicated structured output API. The idiomatic pattern is to force tool use: define your schema as a tool, then set tool_choice to force the model to call it. Without tool_choice: {type: "tool", name: "your_tool"}, the model may choose not to use the tool at all. This isn't constrained decoding — it's still Level 2 — but it's significantly more reliable than prompt engineering.

Google Gemini offers response_schema with constrained decoding, similar to OpenAI's approach. The API takes a raw JSON schema rather than a Pydantic model, so you'll need schema conversion tooling.

For teams working across multiple providers, the Instructor library abstracts the differences. It provides a consistent client.chat.completions.create(response_model=YourPydanticModel) interface across OpenAI, Anthropic, Gemini, and others. Instructor also handles automatic retries on validation failures — if the model returns something that fails Pydantic validation, it re-prompts with the error message and tries again.

The Validation Sandwich

Even when using native structured output, always add a validation layer on top. This isn't paranoia — it's defense against semantic failures that syntactic constraints can't catch.

from openai import OpenAI
from pydantic import BaseModel, field_validator

class ClassificationResult(BaseModel):
reasoning: str
label: str
confidence: float

@field_validator("confidence")
def confidence_must_be_normalized(cls, v):
if not 0.0 <= v <= 1.0:
raise ValueError(f"confidence must be between 0 and 1, got {v}")
return v

@field_validator("label")
def label_must_be_valid(cls, v):
valid_labels = {"positive", "negative", "neutral"}
if v not in valid_labels:
raise ValueError(f"label must be one of {valid_labels}, got {v}")
return v

client = OpenAI()
result = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[...],
response_format=ClassificationResult,
)

# result.choices[0].message.parsed is already a ClassificationResult
# but Pydantic validators run during construction, so they've already fired

The schema enforces structure. Pydantic validators enforce semantics. You need both.

Constrained decoding guarantees syntactic validity, not semantic correctness. A model can return confidence: 1.7 in a float field and satisfy the schema. It can return a label from the schema's enum that's semantically wrong for the input. Validators catch the former; evals catch the latter.

Structured Output in Agent Chains

The reliability math gets worse in multi-step workflows. Each tool call that returns structured data is a step where schema validation could fail. With Instructor's retry behavior, failures are retried with error context — but retries cost tokens and latency, and some failure modes loop.

Two patterns help here:

Narrow your schemas at every step. Don't carry a large, complex schema through every tool call. At each step, extract only the data you need for the next step. Smaller schemas have lower failure rates and less overhead.

Log schema versions with every call. Schemas evolve, and bugs often come from a schema change that wasn't propagated everywhere. Log the schema version alongside the prompt and response. When something fails, you can replay the exact inputs against the schema that was live at the time.

What Still Doesn't Work

Constrained decoding solves the parsing problem, not the modeling problem. A few failure modes persist regardless of schema enforcement:

Hallucinated enum values. If your schema allows enum: ["gpt-4", "claude-3-5-sonnet", "gemini-2-0-flash"] and you add a new model but forget to update the schema, the model will be forced to return one of the valid values — but it may return the wrong one confidently. Schema constraints don't make models accurate; they make them parseable.

Semantic drift in long chains. In multi-step pipelines, structured output from step N feeds the prompt for step N+1. Errors in meaning — not format — compound in ways that parsing checks can't detect. This is where evals and spot-checking matter more than tooling.

Schema mismatch between callers. In production systems with multiple services, it's common to see the schema definition in the calling service diverge from what the downstream consumer expects. Treat your Pydantic models as the source of truth and share them as a package, not as copy-pasted dictionaries.

The Default Should Be Level 3

The engineering argument for native structured output is straightforward: prompt engineering adds retry complexity, function calling adds validation complexity, and both add failure modes that are annoying to debug at 2am. Native structured output with a Pydantic validation layer gives you the strongest guarantees available and eliminates an entire class of production incidents.

The tooling is mature. XGrammar makes constrained decoding fast enough that latency is rarely a concern for simple schemas. The Instructor library removes provider-specific boilerplate. There's no good reason to ship a new LLM pipeline with Level 1 parsing in 2025.

The one real cost is schema design discipline. Flat schemas, described fields, explicit optionality, reasoning-before-conclusions ordering — these aren't complex requirements, but they take intentionality. That discipline is what separates LLM features that work in demos from pipelines that run reliably in production.

Let's stay in touch and Follow me for more thoughts and updates