Skip to main content

Structured Generation: Making LLM Output Reliable in Production

· 10 min read
Tian Pan
Software Engineer

There is a silent bug lurking in most LLM-powered applications. It doesn't show up in unit tests. It doesn't trigger on the first thousand requests. It waits until a user types something with a quote mark in it, or until the model decides — for no apparent reason — to wrap its JSON response in a markdown code block, or to return the field "count" as the string "three" instead of the integer 3. Then your production pipeline crashes.

The gap between "LLMs are text generators" and "my application needs structured data" is where most reliability problems live. Bridging that gap is not a prompt engineering problem. It's an infrastructure problem, and in 2026 we finally have the tools to solve it correctly.

Why Regex and JSON.parse() Are Not Enough

The naive approach looks like this: ask the model to return JSON, then JSON.parse() the response. When that breaks, add a regex to strip markdown code blocks. When that breaks, add more regex. When that breaks, add a try/catch and retry.

This approach makes six assumptions that LLMs violate regularly:

  1. The output is valid JSON
  2. All required fields are present
  3. Field types match what you declared
  4. Values fall within acceptable ranges
  5. No unexpected fields appear
  6. The format is consistent across inputs

In practice, models wrap JSON in markdown fences (```json), add preamble text ("Here is the structured data you requested:"), return JSONL instead of JSON, silently omit optional fields that your downstream code treats as required, or produce type mismatches — a timestamp as a human-readable string when you needed an ISO 8601 format, or a numeric count as a word.

The "works for 10,000 requests, fails on 10,001" failure mode is real. User inputs with unescaped quotes, apostrophes, or special Unicode characters have a way of breaking naive extraction at the worst possible time. The more inputs your system processes, the more certain it becomes that you'll encounter an edge case that your regex doesn't handle.

Patching these failures one by one is a treadmill. The right answer is to treat structured output as an infrastructure concern from the start.

Three Levels of Output Control

Not all structured output techniques offer equal guarantees. It helps to think in terms of reliability tiers.

Level 1: Prompt Engineering

You describe the format you want in the system prompt. "Return a JSON object with fields name (string), score (integer 0–100), and reasoning (string)." This works most of the time — roughly 80–95% of requests come back in the right shape. The failure mode is silent: when the model deviates, your application either crashes or silently corrupts data.

This tier is fine for prototyping or for low-stakes tasks where the occasional malformed response can be discarded. It is not acceptable for production pipelines where reliability matters.

Level 2: Function Calling / Tool Use

Most LLM providers now expose schema-based function calling. You define a JSON Schema and the model returns an object that conforms to it — or at least tries to. This gets you to 95–99% reliability. The remaining failures are semantic rather than structural: a field of type string will always be a string, but it might be the wrong string. An integer field will be an integer, but it might be out of your expected range.

Function calling also adds some overhead and requires your code to handle tool use patterns rather than treating the LLM call as a simple text completion. The tradeoff is usually worth it for production use.

Level 3: Native Structured Output / Constrained Decoding

This is where you get mathematical guarantees instead of statistical ones. The mechanism is constrained decoding: at each token generation step, the model's next-token probability distribution is masked so that only tokens that keep the output on a valid path through your schema remain selectable. Invalid tokens get zero probability — the model literally cannot generate them.

When a JSON opening brace is required, only { and whitespace tokens remain valid. When you're inside a field that declares type: integer, non-numeric characters are masked. The output cannot violate the schema because the schema is enforced at the generation level.

Every major provider reached this capability between 2024 and 2026. OpenAI's Strict Mode (released August 2024) uses a finite state machine compiled from your JSON Schema. Gemini followed with equivalent capability. Anthropic implemented native structured output in late 2025. Self-hosted inference stacks can use libraries like Outlines or the llguidance engine (which OpenAI later credited as foundational to their implementation).

Constrained Decoding: What It Actually Does

The underlying mechanism is worth understanding because it shapes what you can and cannot express in your schemas.

When you provide a JSON Schema, the inference engine compiles it into a finite state machine. Each state represents a position in the grammar — inside an object, expecting a field name, inside a string value, etc. At each token generation step, the engine identifies which states are reachable from the current state and masks the probability distribution to allow only tokens corresponding to valid transitions.

For simple, flat schemas this adds negligible latency. For deeply nested schemas with many optional fields, the state machine can be large, and compilation overhead can matter. Practical recommendations: keep schemas as flat as reasonably possible, avoid deeply nested optional arrays, and pre-compile schemas rather than compiling them on each request.

There's an important limitation: constrained decoding as described above is only available when you have access to the model's token probability distribution. For self-hosted models (via Transformers, llama.cpp, vLLM), this is always available. For API-hosted models (OpenAI, Anthropic, Google), you're dependent on the provider exposing structured output as a first-class feature. The provider handles the constrained decoding on their end — you just specify the schema. You cannot apply token-level constraints to provider APIs yourself.

Grammar-based approaches (using context-free grammars in GBNF format) are more expressive than regex-based approaches and handle recursive structures like nested JSON or variable-length arrays more cleanly. They're the preferred mechanism for self-hosted inference where you control the serving stack.

The Validation Sandwich

Native structured output guarantees that your output is structurally valid — all required fields present, all types correct. It does not guarantee that the values are semantically valid. A timestamp field might be structurally valid as a string but semantically wrong (say, "yesterday" when you need ISO 8601). An email field might be structurally a string but not a valid email address.

For production systems, add a validation layer after the structured output layer. The pattern looks like this:

  1. Schema enforcement at generation time — use native structured output or function calling to guarantee structural validity
  2. Semantic validation at the application boundary — run the structured output through Pydantic (Python) or Zod (TypeScript) models that enforce business rules: valid ranges, valid enum values, cross-field constraints
  3. Error handling that distinguishes structural from semantic failures — structural failures (rare with native structured output) typically call for a retry with a clearer prompt; semantic failures call for prompt refinement or validation feedback loops

Libraries like Instructor (Python) make this pattern ergonomic. It wraps your LLM provider with Pydantic validation and handles retry logic automatically: if validation fails, it passes the validation error back to the model as context and retries. This turns what would be a crash or a silent corruption into a self-correcting feedback loop.

The key insight is to treat LLM output the same way you'd treat input from an untrusted external API. You wouldn't call a third-party REST endpoint and blindly trust its response without validation. Apply the same discipline to model output.

Choosing Your Approach for Different Scenarios

The right tool depends on your deployment context:

API-hosted models (OpenAI, Anthropic, Gemini): Use native structured output with Strict Mode where available. Layer Pydantic/Zod validation on top. Use Instructor or equivalent to handle retry logic. This gets you to effectively 100% reliability for structural compliance plus strong semantic validation.

Self-hosted models (vLLM, llama.cpp, Ollama): Use Outlines or llguidance for token-level constrained decoding. You get the same mathematical guarantees as native structured output, with the ability to enforce arbitrary context-free grammars — useful for domain-specific formats beyond JSON.

Mixed or multi-provider setups: Libraries like BAML take an error-tolerant parsing approach — they use a Rust-based parser that handles malformed JSON gracefully and works uniformly across providers. The tradeoff is that error tolerance doesn't protect you from semantic violations the way constrained decoding does.

High-throughput pipelines: Pre-compile schemas and cache them. Schema compilation (especially for complex grammars) has non-trivial latency. In a pipeline processing thousands of requests per minute, paying compilation overhead on every request is expensive. Most inference frameworks support schema caching.

The Schema Ordering Problem

One subtlety that catches teams by surprise: models generate tokens left to right, and they don't know your constraints in advance. This means the order of fields in your schema affects output quality, not just output structure.

When a model reaches a constrained field late in a long JSON object, it has already committed to its upstream reasoning. If the constrained field forces a value that conflicts with what the model has already said, the result can be internally inconsistent — structurally valid but semantically incoherent.

The practical rule: put fields that anchor the reasoning first. If your schema has an intent field that determines what other fields should contain, put intent before the fields that depend on it. This mirrors how the model naturally reasons — it forces the model to decide the high-level answer before filling in the details.

Similarly, always document your schema in your system prompt. The model generates predictions solely from the tokens it has seen; if it doesn't know what schema to expect, constrained decoding will force it into valid tokens that may not reflect its best answer. Show the model the shape of the expected output before asking it to produce it.

What This Means for Teams Building on LLMs

Unstructured LLM output is a production liability. The engineering discipline around structured generation has matured significantly over the past two years, and there's no longer a good reason to handle raw model responses with ad-hoc parsing in any serious application.

The investment is modest: define your schemas once, pick the right enforcement mechanism for your provider and deployment context, add a validation layer, and handle retries properly. In exchange, you eliminate an entire class of production failures — the silent data corruptions, the unexpected crashes, the "we need to add another regex" tickets that slowly accumulate in every team that skips this step.

The underlying model doesn't change. Your prompts don't change. What changes is the contract between your application and the model — and enforcing that contract at the infrastructure level means you stop fighting format bugs and start focusing on what the model actually knows.

References:Let's stay in touch and Follow me for more thoughts and updates