Skip to main content

The Structured-Output Retry Loop Is Your Hidden Compute Waste

· 11 min read
Tian Pan
Software Engineer

Pull up your structured-output dashboard. The number it proudly shows is something like "98.4% schema compliance." That's the success rate — the fraction of requests that produced a valid JSON object on the first try. The team built a retry wrapper for the other 1.6%, shipped it, and moved on. Two quarters later, the inference bill is up 15% on a request volume that grew by 4%. The CFO wants a story. The engineers don't have one, because the dashboard that tracks structured-output success doesn't track structured-output cost.

Here's the part the dashboard is hiding: the failure path is not a single retry. The first re-prompt fixes the missing enum field but introduces a malformed nested array. The second re-prompt fixes the array but drops a required key. The third pass finally validates, but by then the request has burned four full inference calls plus the original generation, and your per-request token meter shows the sum, not the loop. From the meter's perspective it's one expensive request. From the cost line's perspective it's a stochastic loop you never priced.

This post is about what that loop actually does to your compute budget, why your existing observability can't see it, and the disciplines that make it visible and bounded.

The 2% That Costs 15%

The arithmetic is brutal once you write it down. Suppose your structured-output failure rate is 2% and your retry wrapper averages three attempts before giving up or falling through. The 98% of clean requests cost you 1× tokens each. The 2% that fail cost you 4× tokens each (one initial attempt plus three retries). Doing the math: 0.98 × 1 + 0.02 × 4 = 1.06× the baseline cost — a 6% premium across the whole fleet for a "98% success rate."

But that's the floor, not the ceiling. The retry path uses longer prompts than the success path, because the wrapper appends the previous broken output and a "fix this" instruction. So each retry's input tokens are 1.5–2× the original. The retry path also tends to land on harder inputs — the ones that triggered the failure in the first place — which means the model is generating more cautious, longer outputs. By the time you account for the retry-prompt overhead and the longer outputs, the 2% of failed requests are routinely consuming 12–18% of the compute budget.

The team that costs structured output as a single inference call is paying for a stochastic loop they never priced. The dashboard says "structured output works." The bill says it works some of the time, expensively, in ways nobody has labeled.

The Failure Path Doesn't Fail Cleanly

If retries always converged, this would be a pricing problem, not an architectural one. They don't.

Modern constrained-decoding APIs (OpenAI's Structured Outputs, Anthropic's tool-use schemas, vLLM's XGrammar, Outlines' FSM-based decoding) push schema-compliance failure rates well below 1% for in-distribution inputs. The remaining failures are concentrated on long-tail content: deeply nested schemas, large enum unions, free-text fields that the model wants to format with code fences, multilingual values, and edge cases where the tokenizer's subword boundaries fight the schema's character-level constraints. For these inputs, the model isn't failing at random. It's failing at the same structural feature, attempt after attempt.

The retry loop, in turn, is fighting a model that's fighting its own decoding distribution. What you actually see in production traces:

  • Attempt 1 omits a required field on a 14-key nested schema.
  • Attempt 2, given the broken output and a "fix the missing field" instruction, fixes the missing field but moves a different field outside its parent object — a structural drift the wrapper can't easily caption.
  • Attempt 3 fixes the structural drift but produces an enum value that's a typo of a valid one ("INVTL" instead of "INVAL").
  • Attempt 4 either gets it right or falls through to a degraded code path.

Each attempt is a fresh generation conditioned on a slightly different prompt. There's no monotonic convergence. Two of the four attempts could pass schema validation; that doesn't help, because the wrapper accepts the first one to pass, and that one might still be semantically wrong (one field corrected, another silently truncated). The "structured output is a constrained generation contract the model can fail in retry-amplifying ways" framing is more accurate than "structured output works 98% of the time."

Iteration Caps Are the Wrong Unit

Most retry wrappers cap iterations: "retry up to 3 times." This is the wrong knob. Iterations are a discrete count; cost is a continuous quantity. A request that fails and retries with a small, deterministic schema costs maybe 200 extra tokens per attempt. A request that fails on a 6,000-token document with a deeply nested schema costs 8,000+ tokens per attempt because the wrapper is feeding the previous broken output back in.

Capping at three iterations means your worst-case failure path on a long document is 24,000+ extra tokens — roughly 7× the cost of the success path on that same document. The cap is doing exactly nothing to bound the cost. It's bounding attempts, which is a proxy that doesn't track with the metric the bill cares about.

The fix is to cap the loop in tokens, not iterations: "this request is allowed to spend up to 2× the budget of a successful request before falling through." Implementation is straightforward — track input + output tokens across attempts, abort when the cumulative spend crosses a threshold, route to the degraded path. The threshold itself becomes a tunable knob you can set per-feature: a low-stakes summarization tolerates a 1.2× budget; a high-stakes extraction over a long document gets 3×; a debugging mode gets unbounded with a paging alert.

This single discipline — token-denominated retry budgets — is the difference between "we have a retry policy" and "we have priced our retry policy."

The Schema-Failure Dashboard You Don't Have

A dashboard that says "98.4% structured-output success" is averaging across every schema your team ships. That's the wrong granularity for fixing the loop, because the failures aren't uniformly distributed. One or two schemas account for most of the retry traffic, and within those schemas one or two fields account for most of the failures.

What you want, and what almost no team builds by default, is a dashboard sliced by which field failed validation. For each schema in production, log: which top-level field caused the failure, whether it was a missing-required, an enum mismatch, a type coercion, or a structural error, and the input characteristics (length, language, presence of code blocks). After two weeks of this you'll find that maybe three fields across your whole prompt library account for 70% of the retry budget. They tend to be:

  • An optional-with-defaults field that the model keeps emitting as null against an enum constraint.
  • A free-text field that the model wants to wrap in markdown fences.
  • A nested object whose required keys collide with a similarly-named field in the model's training distribution (the model emits the wrong one).

Once those fields are visible by name, you can fix them — usually with schema redesign rather than a prompt edit. Make the enum accept both null and the sentinel. Strip code fences in post-processing instead of forcing the model to suppress them. Rename the colliding key. None of this is exotic; the only reason it doesn't happen is that the dashboard hasn't surfaced the offenders. The team is debugging "structured output failed" without knowing which field, in which schema, on which input shape.

Falling Through Beats Looping Forever

The third discipline is the one teams resist hardest, because it feels like accepting failure. It's also the one that keeps the bill honest: a fall-through path that prefers degraded structured output over infinite retries when the loop diverges.

Concretely: if a request hits its retry-token budget without producing a valid object, the wrapper should not keep trying. It should commit to one of three options, depending on the feature:

  1. Return the best partial output — the attempt with the fewest schema violations, paired with a flag that downstream services can use to decide whether to consume it.
  2. Return a typed null with a reason code — empty object plus { "_status": "schema_failed", "_field": "summary" }, so the caller can choose to degrade gracefully (skip the feature, show a fallback UI, escalate to a human).
  3. Return the unconstrained generation — the model's freeform answer, with a clear marker that it didn't satisfy the contract. Often more useful than three more retries that won't converge.

Teams resist this because "we don't return broken data." But you already do; the broken data is just hidden behind three extra inference calls and a misleading success rate. Making the fall-through explicit moves the cost from invisible (compute budget) to visible (a flag on a small percentage of responses). That's a strictly better bargain — you can decide what to do with explicit failures; you can't decide what to do with bills you don't understand.

The fall-through also creates a forcing function for schema improvement. If 0.5% of your responses come back with _status: schema_failed, that's a number a product manager can react to. "We're spending 15% of compute on retries" is not a number anyone reacts to, because compute is a line item, not a user-facing artifact.

Per-Call Attribution: First-Attempt Tokens vs. Retry Tokens

The final piece of the discipline is the cheapest to implement and the most embarrassing to skip: split your token attribution between first-attempt and retry tokens at the per-call level, and surface both in your cost reporting.

Most LLM gateways log a single tokens_in and tokens_out per request. If your wrapper is retrying internally, those numbers are sums — they hide the loop. The fix is to emit one log entry per attempt, tagged with attempt_index and a stable request_id, and to roll up the per-attempt cost into a per-request cost report that explicitly shows the retry overhead.

When you do this, two things become possible:

  • Honest unit economics. You can answer "what does a structured-output call to feature X actually cost?" with a number that includes the retry tail. Today, that number is somewhere between "the marketing claim" and "the bill" with no way to reconcile them.
  • Per-tenant or per-feature cost attribution. If one customer's input distribution triggers more retries than another's, you'll see it. If one feature's schema is paying a 4× retry tax compared to its peers, you'll see that too. This is exactly the data needed to decide where to invest schema-redesign effort.

The implementation cost is one structured-log field and one rollup query. The cost of not doing it is that every conversation about "is structured output expensive?" starts from anecdote.

Treat It Like a Stochastic Loop, Not a Single Call

The headline architectural realization is the one the TODO hinted at: structured output is a constrained generation contract that the model can fail in retry-amplifying ways. The single-inference-call mental model is a fiction that survives because the gateway aggregates the loop into one number.

Once you accept that you're running a stochastic loop, the engineering follows mechanically. You don't bound a stochastic loop by counting iterations; you bound it by spend. You don't debug a stochastic loop in aggregate; you slice it by where the divergence happens. You don't recover from a stochastic loop by looping harder; you build a fall-through that turns silent compute waste into explicit, observable failures. And you don't price a stochastic loop by the success path; you price it by the tail.

The teams getting structured output right in 2026 have stopped treating the retry wrapper as a quality fix and started treating it as a cost-control surface. The dashboard that matters isn't the schema-compliance percentage — it's the ratio of retry tokens to first-attempt tokens, sliced by schema and by failing field. When that ratio drifts up, somebody's prompt or schema needs to change. When it drifts down, you can spend the headroom somewhere it's actually noticed.

The 2% is not the problem. The 2% multiplied by an unbounded, unattributed loop is the problem. Bound it, attribute it, and let it fail visibly when it diverges. Your bill will start telling you the truth, which is usually the first step toward fixing anything.

References:Let's stay in touch and Follow me for more thoughts and updates