JSON Mode Is a Dialect, Not a Standard: The Silent Breakage in Your Fallback Path
The first time I watched a fallback router cause a worse incident than the outage it was trying to mitigate, the postmortem document had a header that read: "Primary degraded for 11 minutes. Fallback degraded our parser for 6 days." Nobody had written code wrong. Nobody had skipped the schema review. The integration tests against the secondary provider had been green when the fallback was wired up, eighteen months earlier. What had happened in between was that one of the two providers had quietly tightened its enum coercion policy, and the contract our downstream parsers had been written against — a contract we believed was "JSON Schema, more or less" — had drifted from a shared standard into two slightly incompatible dialects.
This is the failure mode I keep seeing, and it keeps surprising teams that should know better. "JSON mode" sounds like a feature you turn on. It is not. It is a contract you maintain — separately, against every provider you might route to — and the contract drifts every quarter as vendors evolve their structured-output stacks. The "drop-in replacement" your provider docs gestured at when you signed the contract is, in production, a maintained translation layer whose absence converts your fallback path into a paper compliance artifact: present in the architecture diagram, broken on the day you needed it.
The contract you didn't realize you were signing
Until about 2024, "JSON mode" across providers really did mean roughly the same thing: the model would emit syntactically valid JSON, no schema enforcement, you parsed it and prayed. That world is gone. As of 2026, every major provider has shipped a stricter, schema-aware mode under different names — OpenAI calls it Structured Outputs (with the older json_object mode now explicitly labeled "legacy"), Anthropic shipped Structured Outputs as a public beta on Claude Sonnet 4.5 and Opus 4.1 in November 2025, Google built JSON Schema support into the Gemini API on top of its earlier OpenAPI-3.0-flavored Schema object, and xAI, Mistral, Cohere, and Bedrock each have their own variant. The marketing is identical: "guaranteed JSON, conforms to your schema." The semantics are not.
The dialect differences fall into roughly four buckets, and each one bites differently.
Required-vs-optional defaults. OpenAI requires every field at every level of nesting to be declared required; optionality is expressed as a union with null. Gemini is the opposite: every field is required by default, and you express optionality through a separate optionalProperties array (in the Firebase AI Logic SDK) or by manually toggling the nullable flag in Vertex. The same schema shape, ported between the two, means two completely different things — and the "ported" version compiles and runs without an error, so you only notice the divergence when downstream code starts seeing fields that used to always be present arrive as null, or vice versa.
additionalProperties handling. OpenAI demands additionalProperties: false at every object level under strict mode; if you forget, the request is rejected at the API boundary. Anthropic's structured-output mode is more permissive — it does not require the flag, and a schema with looser bounds will run, but the model is allowed to emit extra keys your parser never accounted for. The team that designs its schema for OpenAI and tests it against Anthropic in CI will see green builds and a slowly growing rate of "unexpected key in payload" warnings in production logs that nobody triages because, individually, they aren't errors.
Subset-of-JSON-Schema coverage. None of the providers implements the full JSON Schema specification. OpenAI's strict mode silently drops minLength, maxLength, minItems, maxItems, and complex regex patterns — your schema may declare them, the API will accept the schema, and the model will produce output that violates them with no error. Gemini implements a different subset, with different gaps. Bedrock has yet another. The "send the same schema to all three" pattern almost works, until your validation logic on the consumer side starts asserting properties the producer was never enforcing.
Failure-on-violation behavior. What happens when the model can't satisfy the schema is the most consequential difference and the least documented. OpenAI's strict mode aborts generation. Some providers fall back to producing degraded prose. Some silently coerce — turning an enum value the model wanted to emit ("urgent_billing") into a near-neighbor that's in the allowed set ("urgent"), with no signal to the caller that coercion happened. Your downstream code can't tell the difference between a successful structured response and a coerced one, and that ambiguity is where the silent corruption lives.
The failover trap
If you only ever talk to one provider, the dialect problem is annoying but bounded — you learn the dialect, you ship the workaround, you move on. The trap snaps shut when you wire up a fallback router. That is the precise moment your single-provider integration becomes a multi-provider integration, and almost every team I've seen wire one up has done it as a one-day project and then never revisited it.
The architecture that breaks looks like this. A request comes in. Your gateway prepares the tool definitions and schemas once, normalizes them for the primary model, and passes the prepared payload into a runner. If the primary fails — rate-limit, timeout, 5xx, content-policy refusal, doesn't matter — the runner re-issues the same prepared payload against the secondary provider. The schema preparation is upstream of the failover; the failover is downstream of the preparation. So the schema that was carefully normalized for OpenAI's strict-mode constraints (no minLength, additionalProperties: false everywhere, optional fields as nullable unions) hits a Gemini endpoint that interprets nullable as "this field will sometimes literally be the JSON null," or hits an Anthropic endpoint that doesn't enforce additionalProperties: false at all and starts emitting a key your downstream parser doesn't recognize.
Sometimes the secondary provider rejects the payload outright with a schema-validation error and your fallback fails closed — annoying, visible, fixable. The much worse version is when it accepts the payload, generates output that looks fine to a human eye, and your parser quietly produces records that violate invariants downstream code depends on. A field that was an integer becomes the string "$500,000" because the secondary provider's coercion is more permissive. An enum acquires a value never seen in production traffic. A list field that should never have been empty arrives as [] because the secondary's interpretation of "required" with a nullable item type produced a different default. Six weeks later, an analyst notices that all the high-value contracts in the database show zero risk exposure, and you discover that your parser silently turned every coerced string into 0.
The team that designed the primary path is rarely the team that wakes up at 3 AM during the failover. The schema discipline that works on the happy path is exactly the discipline that erodes against drift when nobody is watching the secondary.
What the dialect-mapping discipline actually looks like
There is no clever tooling that solves this. Every team that handles it well has converged on roughly the same four practices, and the practices are unsexy.
Maintain a documented dialect matrix. A table — not a wiki page that decays, an actual artifact in your repo — with columns for each provider on your fallback list and rows for each schema feature you rely on: additionalProperties handling, required-default behavior, supported keyword subset, enum coercion policy, failure-on-violation behavior, root-type constraints (OpenAI requires the root to be an object and disallows top-level anyOf; others don't), maximum nesting depth, maximum property count. The act of writing the matrix surfaces the dialect deltas your team has been carrying as tribal knowledge. Update it the first week of every quarter; it changes.
Contract tests against every provider on the fallback list. Not a test that asserts "the model returned JSON" — a test that exercises the edge cases of your specific schema shapes against every provider you might route to. Deeply nested unions. Optional discriminators. Enums where the obvious near-neighbor is also a valid value. Empty-list cases for required arrays. Numeric fields where the model is tempted to emit a currency-formatted string. Run these on a schedule, not just on schema changes, because the test you care about is "did the provider's behavior drift?" — and provider behavior drifts independently of your code.
A normalization layer between provider and consumer. Every structured response, regardless of provider, runs through one internal type system before it touches downstream code. Pydantic, Zod, your own DTO layer — the specific tool matters less than the constraint that no consumer ever sees the raw provider payload. The normalization layer is where coercion is explicit and logged: if Gemini emitted a field that OpenAI would have refused, you transform it to your canonical shape and emit a metric so the divergence is visible. Without this layer, dialect differences leak into business logic, and the business logic accidentally codifies the primary provider's quirks as system invariants.
Shadow your primary against the secondary, on a schedule. Once a week — not once a quarter, not "when we remember" — replay a sample of real production requests against the secondary provider and diff the structured outputs. Not the prose-similarity diff (that's noise); the schema-shape diff. Did the secondary start producing a key the primary doesn't? Did an enum distribution shift? Did the rate of "field X is null" change? Shadowing is the canary that surfaces dialect drift before failover does, and the cost is small relative to the cost of a 6-day silent corruption window.
The economics nobody computes honestly
There is a frame I keep watching teams get backwards: the assumption that the cheapest secondary provider on paper is the cheapest secondary provider in practice. Per-token pricing is the line item that shows up on the cloud bill and the line item that gets compared in the procurement deck. Integration debt is the line item that doesn't show up anywhere until an incident, and integration debt is roughly inverse to dialect proximity to your primary.
If your primary is OpenAI and you choose a secondary whose dialect is closest to OpenAI's strict-mode contract, your integration is mostly free — the same prepared schema mostly works, your normalization layer has fewer transformations to write, your contract tests have fewer edge cases to cover. If you choose a secondary whose dialect is materially different — a different required-vs-optional default, a different subset of supported keywords, a different coercion policy — you are paying the integration cost up front (in engineer-quarters of dialect mapping and contract tests) or, more commonly, deferring it into the eventual incident.
The honest cost equation isn't price per million tokens. It's price per million tokens plus amortized engineer-quarters of dialect maintenance plus expected cost of one dialect-related incident per N quarters, and the third term dominates everything else for any team whose AI feature is on the critical path. Teams that make this calculation explicitly tend to either (a) pay slightly more for a secondary whose dialect is closer to the primary, or (b) commit to the engineering investment to maintain a real translation layer. Teams that don't make the calculation explicitly tend to choose the cheapest paper-cost option and then absorb the incident as a one-time event, repeatedly.
The architectural realization
Here's the version of this I want every AI-platform team to internalize: structured-output portability is not a property of "JSON." JSON itself is portable. JSON Schema, the spec, is portable. What is not portable is the specific, dialect-bounded subset of JSON Schema that each provider has chosen to implement, combined with the specific behavior each provider exhibits when generation can't satisfy that schema.
A fallback path is a translation layer, whether you wrote one or not. If you didn't write one, the translation is implicit, lives in the divergence between what your primary emits and what your secondary emits, and surfaces as bugs in downstream consumers. If you wrote one — and maintain it, and shadow against it, and run contract tests through it — the translation is explicit, owned, and the failure modes are visible to the team that owns them.
The teams I've watched handle this best treat their structured-output layer the way mature distributed-systems teams treat their RPC framing: as a contract that has to be specified, versioned, tested across implementations, and monitored for drift. The teams that handle it worst treat it as "JSON" — an interchange format they assume is universal, until the day a fallback router proves it isn't.
If your platform has a fallback path, the question to ask this quarter is not "do we have a secondary provider configured?" The question is: when did anyone last run a structured request through that secondary and diff the schema shape against the primary? If the answer is "I don't know" or "at integration time," your fallback path is a paper artifact, and the next outage is the test you didn't write.
- https://developers.openai.com/api/docs/guides/structured-outputs
- https://openai.com/index/introducing-structured-outputs-in-the-api/
- https://platform.claude.com/docs/en/build-with-claude/structured-outputs
- https://ai.google.dev/gemini-api/docs/structured-output
- https://blog.google/technology/developers/gemini-api-structured-outputs/
- https://www.glukhov.org/post/2025/10/structured-output-comparison-popular-llm-providers
- https://docs.litellm.ai/docs/completion/json_mode
- https://docs.litellm.ai/docs/proxy/reliability
- https://www.promptfoo.dev/docs/guides/evaluate-json/
- https://docs.cohere.com/docs/structured-outputs
- https://docs.x.ai/developers/model-capabilities/text/structured-outputs
