Skip to main content

The Structured Output Schema Two Models Interpret Differently

· 9 min read
Tian Pan
Software Engineer

The first time your fallback route fires in production is the wrong time to discover that your two providers do not agree on what your schema means. The JSON Schema looks identical in both client configurations. The validator passes on both outputs. The downstream code reads the field by name and gets a value. And then a billing total comes out as a string of digits instead of an integer, or a list of length one arrives as a bare object instead of a single-element array, and a code path that has been green for six months silently returns the wrong answer.

The seductive thing about structured output is that it removes a class of bugs — unparseable JSON, hallucinated fields, missing keys — and so it feels like it removes the parsing problem entirely. What it actually does is move the parsing problem one layer up, from the lexer to the type system, where it is much harder to see. Two providers can both honor a JSON Schema and still produce outputs that are not interchangeable, because "honor" has at least four distinct meanings in this corner of the ecosystem and your schema does not specify which one you wanted.

The schema you wrote is not the schema each provider enforces

The first divergence is at the API boundary, before a single token is generated. OpenAI's strict mode requires every property in properties to also appear in required, requires additionalProperties: false on every object, and rejects schemas that omit either. Anthropic does not natively accept a json_schema response format at all — you express the contract as a tool's input_schema and force the model to call it. Gemini accepts a JSON Schema subset but historically required every array items to declare an explicit type, where the spec allows items: {} to mean "any type." Each provider has documented opinions about what subset of JSON Schema it implements, and those opinions do not fully overlap.

This means the same schema, dropped into three providers, takes three different paths through the request pipeline. One path enforces it at decode time with constrained generation. One path validates the schema at the API edge and rejects it before any token is generated. One path treats the schema as a prompt-time hint and applies looser enforcement. The compliance rates differ — OpenAI's strict mode reports parse failure rates under 0.1%, Anthropic's tool use reports under 0.2%, Gemini's response schema under 0.3% — but the more dangerous gap is not the failure rate. It is what each provider does with the cases the schema did not pin down.

The four boundary cases that always diverge

Practitioners who have actually ported a structured output pipeline between providers report the same shortlist of cases that quietly behave differently.

Null versus missing. A field declared as {"type": ["string", "null"]} can be returned as "field": null or it can be omitted from the JSON entirely. Both are arguably valid under common interpretations, but they parse to different things in your downstream code — one yields a key with a None value, the other raises a KeyError or returns a default. One provider, on the same input, may consistently emit explicit nulls; another may consistently omit. Your code that does payload["address"] works for one and crashes for the other. The schema does not distinguish "the model knows there is no address" from "the model has nothing to say about address," but your runtime does.

Single-element array versus bare value. A field declared as an array of strings will sometimes be emitted as ["one item"] and sometimes as "one item" when only one item is appropriate. Strict mode catches this; non-strict mode often does not, especially for tool-arguments paths where the model is interpolating between formats it has seen in training data. A downstream for item in payload["tags"] then either iterates over the array or iterates over the characters of the string, and the latter takes a long time to notice if your tags happen to be reasonable English words.

Integer versus string-of-digits. When the schema says integer, models trained on web-scale text have an enduring temptation to emit "42" instead of 42, especially for fields that "feel like" identifiers or amounts. Some providers' constrained decoding catches this at generation time. Others rely on a post-hoc validator. Validators in different SDKs make different coercion choices — Pydantic will silently turn "32" into 32; a stricter validator will reject it. Your pipeline absorbs the coercion in one route and rejects in the other, and the divergence is invisible until you compare outputs side by side.

Empty array versus absent field. When the model has nothing to return for a list-typed field, it may emit [], omit the field, or — worst case — emit null for a field declared as a non-nullable array. Downstream code that treats absence as "no items" works for two of three cases and breaks on the third. There is an open issue in at least one major inference engine about structured generation failing to produce empty arrays for array-typed fields, which means even the same provider can flip behavior between releases.

None of these are bugs in any particular provider. They are points the spec does not pin down, on which each provider has made a defensible choice — and the defensible choices do not match.

Why the fallback route is the moment it surfaces

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates