Skip to main content

The Structured-Output Retry Loop Is Your Hidden Compute Waste

· 11 min read
Tian Pan
Software Engineer

Pull up your structured-output dashboard. The number it proudly shows is something like "98.4% schema compliance." That's the success rate — the fraction of requests that produced a valid JSON object on the first try. The team built a retry wrapper for the other 1.6%, shipped it, and moved on. Two quarters later, the inference bill is up 15% on a request volume that grew by 4%. The CFO wants a story. The engineers don't have one, because the dashboard that tracks structured-output success doesn't track structured-output cost.

Here's the part the dashboard is hiding: the failure path is not a single retry. The first re-prompt fixes the missing enum field but introduces a malformed nested array. The second re-prompt fixes the array but drops a required key. The third pass finally validates, but by then the request has burned four full inference calls plus the original generation, and your per-request token meter shows the sum, not the loop. From the meter's perspective it's one expensive request. From the cost line's perspective it's a stochastic loop you never priced.

This post is about what that loop actually does to your compute budget, why your existing observability can't see it, and the disciplines that make it visible and bounded.

The 2% That Costs 15%

The arithmetic is brutal once you write it down. Suppose your structured-output failure rate is 2% and your retry wrapper averages three attempts before giving up or falling through. The 98% of clean requests cost you 1× tokens each. The 2% that fail cost you 4× tokens each (one initial attempt plus three retries). Doing the math: 0.98 × 1 + 0.02 × 4 = 1.06× the baseline cost — a 6% premium across the whole fleet for a "98% success rate."

But that's the floor, not the ceiling. The retry path uses longer prompts than the success path, because the wrapper appends the previous broken output and a "fix this" instruction. So each retry's input tokens are 1.5–2× the original. The retry path also tends to land on harder inputs — the ones that triggered the failure in the first place — which means the model is generating more cautious, longer outputs. By the time you account for the retry-prompt overhead and the longer outputs, the 2% of failed requests are routinely consuming 12–18% of the compute budget.

The team that costs structured output as a single inference call is paying for a stochastic loop they never priced. The dashboard says "structured output works." The bill says it works some of the time, expensively, in ways nobody has labeled.

The Failure Path Doesn't Fail Cleanly

If retries always converged, this would be a pricing problem, not an architectural one. They don't.

Modern constrained-decoding APIs (OpenAI's Structured Outputs, Anthropic's tool-use schemas, vLLM's XGrammar, Outlines' FSM-based decoding) push schema-compliance failure rates well below 1% for in-distribution inputs. The remaining failures are concentrated on long-tail content: deeply nested schemas, large enum unions, free-text fields that the model wants to format with code fences, multilingual values, and edge cases where the tokenizer's subword boundaries fight the schema's character-level constraints. For these inputs, the model isn't failing at random. It's failing at the same structural feature, attempt after attempt.

The retry loop, in turn, is fighting a model that's fighting its own decoding distribution. What you actually see in production traces:

  • Attempt 1 omits a required field on a 14-key nested schema.
  • Attempt 2, given the broken output and a "fix the missing field" instruction, fixes the missing field but moves a different field outside its parent object — a structural drift the wrapper can't easily caption.
  • Attempt 3 fixes the structural drift but produces an enum value that's a typo of a valid one ("INVTL" instead of "INVAL").
  • Attempt 4 either gets it right or falls through to a degraded code path.

Each attempt is a fresh generation conditioned on a slightly different prompt. There's no monotonic convergence. Two of the four attempts could pass schema validation; that doesn't help, because the wrapper accepts the first one to pass, and that one might still be semantically wrong (one field corrected, another silently truncated). The "structured output is a constrained generation contract the model can fail in retry-amplifying ways" framing is more accurate than "structured output works 98% of the time."

Iteration Caps Are the Wrong Unit

Most retry wrappers cap iterations: "retry up to 3 times." This is the wrong knob. Iterations are a discrete count; cost is a continuous quantity. A request that fails and retries with a small, deterministic schema costs maybe 200 extra tokens per attempt. A request that fails on a 6,000-token document with a deeply nested schema costs 8,000+ tokens per attempt because the wrapper is feeding the previous broken output back in.

Capping at three iterations means your worst-case failure path on a long document is 24,000+ extra tokens — roughly 7× the cost of the success path on that same document. The cap is doing exactly nothing to bound the cost. It's bounding attempts, which is a proxy that doesn't track with the metric the bill cares about.

The fix is to cap the loop in tokens, not iterations: "this request is allowed to spend up to 2× the budget of a successful request before falling through." Implementation is straightforward — track input + output tokens across attempts, abort when the cumulative spend crosses a threshold, route to the degraded path. The threshold itself becomes a tunable knob you can set per-feature: a low-stakes summarization tolerates a 1.2× budget; a high-stakes extraction over a long document gets 3×; a debugging mode gets unbounded with a paging alert.

This single discipline — token-denominated retry budgets — is the difference between "we have a retry policy" and "we have priced our retry policy."

The Schema-Failure Dashboard You Don't Have

A dashboard that says "98.4% structured-output success" is averaging across every schema your team ships. That's the wrong granularity for fixing the loop, because the failures aren't uniformly distributed. One or two schemas account for most of the retry traffic, and within those schemas one or two fields account for most of the failures.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates