Skip to main content

Structured Output Is Not Structured Thinking: The Semantic Validation Layer Most Teams Skip

· 11 min read
Tian Pan
Software Engineer

A medical scheduling system receives a valid JSON object from its LLM extraction layer. The schema passes. The types check out. The required fields are present. Then a downstream job tries to book an appointment and finds that the end_time is three hours before the start_time. Both fields are correctly formatted ISO timestamps. Neither violates the schema. The booking silently fails, and the patient gets no appointment — no error surfaced, no alert fired.

This is what it looks like when schema validation is mistaken for correctness validation. The model followed the format. It did not follow the logic.

What Constrained Generation Actually Guarantees

Modern structured output APIs — JSON mode, tool-calling schemas, grammar-constrained generation — make a specific promise: the output will conform to the structure you specified. Token masks or finite-state machine constraints ensure that the model can only generate tokens that produce valid JSON matching your schema. This is a syntactic guarantee. It is not a semantic one.

When you specify a schema with start_date and end_date as ISO 8601 strings, the constraint system guarantees you get two strings in that format. It cannot guarantee that end_date is after start_date. When you define a confidence field as a float, the constraint ensures you get a number. It cannot guarantee that number falls in the zero-to-one range you expect. When your extraction schema includes a status field with an enum of ["pending", "active", "completed"], the constraint ensures you get one of those three values. It cannot guarantee the value is semantically appropriate for the extracted record.

The constraint system operates at the token level, enforcing grammar rules. Your application operates at the meaning level, enforcing domain rules. These are different problems requiring different solutions.

The Taxonomy of Schema-Valid Failures

Production systems see a predictable set of semantic failures once teams start looking for them.

Temporal contradictions are the most common. An agent extracts event records where start_time and end_time are both valid timestamps but in the wrong order. A contract parser returns an effective date in 2025 and an expiration date in 2024. A scheduling system produces duration_minutes: 90 alongside timestamps that span 45 minutes. Each field is individually valid; the combination is nonsensical.

Out-of-range numeric outputs appear regularly in extraction and scoring pipelines. Confidence scores come back as 1.7. Percentages exceed 100. Priority rankings intended for a 1–5 scale return as 8 or 0. Monetary amounts arrive as negative numbers in contexts where only positive values are valid. The schema type is satisfied — these are all numbers — but the application breaks when downstream code assumes valid ranges.

Mutually exclusive field combinations show up in classification tasks. A document gets tagged as both requires_human_review: true and auto_approved: true. An order status reads "completed" while fulfillment_date is null. A user account is marked verified: true with verification_token still populated — a field that should be cleared on verification. Individually, each value looks fine. Together, they represent an impossible state.

Silent type coercion failures matter most at tool boundaries. An LLM returns customer_id: 12345 as a number when the downstream tool signature expects a string. JSON parsers often accept this silently; the tool receives the wrong type, behaves unexpectedly, and the calling system has no indication that anything went wrong. The schema said customer_id should exist, and it does. What it was silent about was the exact type contract the tool requires.

Hallucinated field values that pass structural validation cause the subtlest failures. A name extraction returns "N/A" when no name was found — technically a string, schema-valid, but treated as a literal name downstream. An address field returns an empty string instead of null, which the schema permits but the application rejects. A date field returns "9999-12-31" as a sentinel for "no date" — a valid ISO timestamp that downstream date arithmetic converts into nonsense.

Why Teams Don't Build the Semantic Layer

Most teams ship schema validation and stop there. The reasons are understandable: once you get structured output APIs working and see that JSON is no longer failing to parse, it feels like the problem is solved. The schema catches type errors. The LLM is "following instructions." What could go wrong?

The answer is that schema validation and semantic validation solve different problems, and the schema problem is the easier one to see. Schema failures produce immediate, hard errors — JSON parse exceptions, type coercion failures, missing required field exceptions. These are loud and they happen in development. Semantic failures are quieter. The appointment that doesn't book. The order that enters an impossible state. The dashboard metric that drifts wrong over days. These surface in production, often attributed to something else.

There's also a mental model problem. Teams building with structured output APIs tend to think of the schema as a contract with the model. If the model follows the contract, the output is correct. But the schema is only a partial contract — it describes structure, not meaning. The complete contract includes business rules, domain constraints, cross-field invariants, and application-level semantics that schemas can't express.

Building the Semantic Validation Layer

The validation layer sits between the model's output and your application logic. Its job is to catch everything that schema validation cannot.

Layer 1: Schema validation. This is what you already have. It runs immediately on the raw model output, catches structural failures, and is cheap — pure in-memory computation with no API calls. It should fail fast and hard on structural issues before anything else runs.

Layer 2: Domain constraint validation. This is rule-based validation of field values and cross-field relationships. Write these as explicit validators against your output model. Date range checks. Numeric range assertions. Mutual exclusivity rules. Status transition validity. These run synchronously on the output object and fail with specific error messages that identify exactly which constraint was violated and why.

Pydantic validators are the natural home for this in Python applications. A validator on an output model can inspect all fields simultaneously, not just the field being typed. When you define a check_date_range validator that fires on the end_date field, it can reference start_date from the same model and raise a ValueError with the message "end_date must be after start_date" if the constraint fails. That error message is the key — it becomes the feedback that enables recovery.

Layer 3: Business rule validation. Some constraints require external context: database state, current system configuration, live inventory counts, user permissions. These are more expensive and should run only after Layers 1 and 2 pass. Gate the expensive checks behind the cheap ones. A record that fails a date range check doesn't need a database lookup to determine it's invalid.

Layer 4: Semantic plausibility checks. Some failures require judgment rather than rules — content that is structurally correct and domain-valid but semantically implausible for the input context. A sentiment extraction that marks a clearly negative review as highly positive. A category classifier that assigns a technical document to a clearly unrelated domain. Rule-based checks can't catch these. LLM-as-judge can: a lightweight grading call that takes the input, the output, and your evaluation criteria, and flags implausible results for retry or human review.

The Validate-Retry Loop

Structured validation failures should trigger a recovery loop, not an immediate error. The model generated a constraint violation. Tell it exactly what went wrong and ask it to fix it.

This is where specific error messages pay off. "Validation failed" is not useful feedback. "end_date (2024-01-15) is before start_date (2024-03-20). Generate new dates where end_date falls after start_date" gives the model the exact information it needs to correct the output. One retry resolves the majority of semantic validation failures when the feedback is this specific.

The loop structure is: generate → validate → if failed, retry with error context → validate again → if still failed after N retries, escalate. Frameworks like Instructor implement this loop automatically, feeding validation errors back as context for the next generation attempt. What matters is that the retry includes not just the original task prompt but the specific violation and what correction is needed.

Cap retries at two or three. If the model is consistently violating the same constraint after multiple attempts, the problem is likely in your prompt or schema design, not in a transient generation issue. Unlimited retries against a structural prompt problem will produce unlimited failures.

Schema Design as a Semantic Aid

The schema itself can do more work to prevent semantic failures before they happen. A few patterns help consistently:

Use optional fields deliberately. If a field should be null when data is absent, make it Optional[str] and say so explicitly. Models given a required string field with no valid value to fill in will hallucinate one. Give them the null path.

Add description fields to your schema properties. These descriptions become part of the prompt context when you use schema-driven tool calling. A field described as "confidence score from 0.0 to 1.0, where 1.0 is maximum confidence" is less likely to produce out-of-range values than a field described only as "confidence score."

Prefer enums over strings for categorical outputs. An enum of ["active", "inactive", "pending"] eliminates an entire class of semantic failure — the model cannot generate a value that isn't in the set. Where business logic further constrains which enum values are valid in which contexts, encode that logic in your validator rather than relying on the model to infer it.

Add explicit sentinel handling. If your application needs to distinguish "not found" from empty, add a boolean not_found field rather than expecting the model to leave a string field empty or null in the right circumstances. Explicit beats implicit at every layer.

The Failure Mode When You Skip This Layer

The systemic failure mode from skipping semantic validation isn't a single catastrophic error. It's slow contamination. Schema-valid but semantically invalid records accumulate in your database. Reports run on that data produce wrong numbers. Downstream logic that assumed valid ranges hits edge cases no one anticipated. Users discover incorrect outputs and lose trust in the system. By the time the cause is identified, there's a backfill problem.

The LLM portion of your pipeline gets blamed for unreliability. The actual cause is that your application accepted invalid outputs because they were syntactically valid. The model's error was in content; your system's error was in accepting content it should have rejected.

What the Validation Layer Is Not

Semantic validation is not a substitute for evals. Evals measure whether your system produces correct outputs across a representative distribution. Validation catches specific constraint violations at runtime. Both are necessary; they operate at different layers of the problem.

Semantic validation is also not a substitute for good prompt engineering. If your model consistently produces date range violations, the first question is whether your prompt makes the constraint explicit. Validation catches what prompt engineering misses, but it shouldn't be your first line of defense against predictable failures.

The validation layer is a production reliability mechanism. Its job is to ensure that the application layer only receives semantically valid inputs, that invalid inputs are corrected or escalated rather than silently accepted, and that when something goes wrong, the error is specific enough to drive a fix.

The Minimum Viable Version

If you have no semantic validation layer today, start here: enumerate the top five constraint violations your application would fail on if they appeared in structured output, and write validators for those five. Date ordering for any system with time ranges. Numeric ranges for any system with scores or percentages. Status field combinations for any system with state machines. These five validators will catch the majority of semantically invalid outputs your system is currently silently accepting.

Schema compliance is a prerequisite, not a finish line. The finish line is outputs that your application can trust to mean what they say — not just in the format your schema describes, but in the domain sense your business logic requires.

References:Let's stay in touch and Follow me for more thoughts and updates