Skip to main content

Structured Output in Production: Getting LLMs to Return Reliable JSON

· 8 min read
Tian Pan
Software Engineer

At some point in production, every LLM-powered application needs to stop treating model output as prose and start treating it as data. The moment you try to reliably extract a JSON object from a language model — and feed it downstream into a database, API call, or UI — you discover just how many ways this can go wrong. The model wraps JSON in markdown fences. It generates a valid object but omits required fields. It formats dates inconsistently across calls. It hallucinates enum values. Any one of these failures silently corrupts downstream state.

Structured output has evolved from an afterthought into a first-class concern for production LLM systems. This post covers the three main mechanisms for enforcing it, where each breaks down, and how to design schemas that keep quality high under constraint.

Three Ways to Enforce Structure (and Their Failure Modes)

1. Prompt-Only JSON Mode

The simplest approach: ask the model to respond in JSON, either freeform or with a schema example in the prompt. Most providers offer a "JSON mode" that guarantees syntactically valid JSON without enforcing any specific schema — the model just can't generate malformed output.

This works until it doesn't. JSON mode eliminates parse errors, but it gives you no guarantees about:

  • Required fields being present
  • Types being correct ("score": "high" instead of "score": 8)
  • Enum values being constrained to your allowed set
  • Nested structures matching your expected shape

JSON mode is the right tool when you genuinely don't know the output shape ahead of time — exploratory extraction, open-ended summarization. For everything else, you need something stronger.

2. Constrained Decoding (Native Structured Output)

Modern LLM providers — OpenAI, Anthropic, Google — now offer schema-enforced structured output that works by modifying the model's generation process at the token level. Before each token is generated, tokens that would violate the schema are masked out of the probability distribution. The model physically cannot generate output that doesn't conform to your Pydantic or Zod schema.

OpenAI's "Strict Mode," introduced in mid-2024, uses this approach. The promise: 100% schema compliance, not "usually correct." For data extraction pipelines and agent tool calls, this is a meaningful guarantee.

But constrained decoding has a subtle cost that's easy to miss: quality degradation under tight constraints.

The mechanism is that the model generates tokens one at a time, without knowing what it will generate next. When schema constraints eliminate all the high-probability tokens, the model must choose from lower-ranked alternatives. The output stays structurally valid, but semantically it reads like a bad translation — technically correct but slightly off.

A concrete example: if you constrain a model to output one of exactly two string values for a classification field, and the correct answer isn't one of them, the model picks the closest option rather than signaling uncertainty. No error is raised. The output is schema-valid and semantically wrong.

This means constrained decoding doesn't eliminate the need for human-readable reasoning in your pipeline — it just changes the shape of failures from crashes to silent misclassifications.

3. Validation with Retry Loops

The third approach treats the LLM as an untrusted source and validates its output against your schema after the fact, automatically retrying on failure with the error message included as context. Libraries like instructor (Python) implement exactly this: wrap your LLM call, validate the response, and if validation fails, send the error back in a follow-up message so the model can self-correct.

This works well for semantic constraints that constrained decoding can't express: "the start date must be before the end date," "this field is required when another field is set," "the confidence score must explain the reasoning." You define these as Pydantic validators and the retry loop handles them.

The tradeoff is latency and cost. Each retry is another LLM call. For most applications, one retry catches the vast majority of failures. The important design question is what to do when retries are exhausted — fail closed, fall back to a default, or flag for human review.

Schema Design: The Hidden Lever

Even with perfect enforcement, poor schema design degrades output quality. A few principles that make a measurable difference:

Reason first, commit second. LLMs generate left-to-right. If your schema has a classification field as the first property, the model must commit to a value before it has worked through the available evidence. Move free-text reasoning fields earlier in the schema so the model's chain of thought precedes constrained choices.

Make optional fields actually optional. If a piece of information might not exist in the input, marking a field as required forces the model to hallucinate a value. Use null or optional types wherever data might be absent. A missing field is honest; a hallucinated value is corrupting.

Keep schemas flat and focused. Deeply nested schemas (4+ levels) and wide schemas (50+ fields) both degrade quality. The model must maintain more state as it generates, and constraint pressure compounds through nesting. If you need to extract a large number of fields, split into multiple focused extraction calls.

Use descriptive field names. The model uses field names as implicit prompt context. A field named category could mean anything. A field named content_moderation_category with a description narrows the model's interpretation. Enum values also benefit from names that are semantically unambiguous — avoid abbreviations and internal codes.

Avoid unsupported JSON Schema features. Provider implementations don't support the full JSON Schema spec. OpenAI's Strict Mode, for instance, requires all object properties to be listed in required and doesn't support additionalProperties: true. Test your schemas against actual provider behavior, not the specification.

The Validation Layer You Still Need

Even with constrained decoding guarantees, you should validate. Here's why:

First, provider APIs change. A feature that guarantees 100% compliance today may have edge cases under load, during model updates, or for unusually long outputs. Defense in depth means your application doesn't break silently when a guarantee weakens.

Second, semantic correctness is orthogonal to structural correctness. A well-formed JSON object with plausible field values can still be semantically wrong — dates in the past that should be future, scores outside valid business ranges, contradictory flags set simultaneously. These failures are invisible to constrained decoding.

Third, validation gives you observability. When validation fails, you know exactly which field failed and why. Without it, you're waiting for downstream errors to surface the problem — often much later in a pipeline where the root cause is hard to trace.

A practical production pattern:

  • Use provider-native structured output (constrained decoding) as the first line of defense for schema compliance
  • Use Pydantic/Zod validators for semantic constraints and cross-field validation
  • Add a retry loop with error context for cases where the first attempt fails
  • Log all validation failures with input context for monitoring and prompt improvement

Choosing the Right Approach

The choice isn't binary. Most production systems layer all three mechanisms:

ScenarioRecommended Approach
Data extraction pipelinesConstrained decoding + validation
Agent tool callsConstrained decoding (schema must match tool signature exactly)
Classification with confidence scoringConstrained decoding + reasoning field first
Complex cross-field validationValidation with retry loop
Exploratory or open-ended extractionJSON mode + post-processing

The one approach to avoid in production is relying solely on prompting. "Always respond in valid JSON" produces good results in demos and bad results under distribution shift — new input patterns the model hasn't seen, long inputs that trigger different behavior, or minor model updates that change output format defaults.

Monitoring Structured Output in Production

Structured output failures are deceptive because they often don't surface as errors. The model returns a 200 with valid JSON, validation passes, and a subtly wrong value propagates into your system. The failures that matter most are hard to detect without deliberate instrumentation.

Metrics worth tracking:

  • Validation failure rate per schema field (which fields fail most often)
  • Retry rate (what percentage of calls require a second attempt)
  • Schema coverage rate (what percentage of optional fields are populated vs. null — sudden changes signal input drift)
  • Constrained vs. unconstrained token distributions (available if you control the serving stack)

Schema coverage in particular is an underused signal. If a field that's usually populated starts returning null more often, something upstream changed — either the input content stopped containing that information, or the model started hedging differently. Either is worth investigating before it becomes a data quality incident.

The Bottom Line

Structured output is now table stakes for production LLM applications, and the ecosystem has matured enough that there's no excuse for relying on regex-parsing model responses. Native constrained decoding eliminates whole categories of failure. But it introduces its own failure mode — silent quality degradation — that requires careful schema design and a validation layer to catch.

The engineers who get this right treat structured output as a contract between the LLM and the rest of the system. Define the contract precisely, enforce it at multiple layers, monitor for drift, and build retry logic for the cases where the first attempt doesn't hold up. The goal isn't just valid JSON — it's JSON you can trust.

References:Let's stay in touch and Follow me for more thoughts and updates