The Logprobs Field Your Provider Removed That Broke Your Confidence Router Silently
The most expensive line in the postmortem was the one nobody wrote: a 200 OK with a missing field. The router that was supposed to escalate hard questions to the stronger model had been escalating zero percent of traffic for six weeks. The cost dashboard was celebrating. The quality dashboard was sliding, but only on the hard-question slice the standing eval set underweighted. Everything looked like a win until a customer complained about a specific kind of question the system used to handle correctly.
The cause was a response shape change one tier up the contract stack. The provider's mid-tier plan had dropped per-token logprobs as part of what the release notes called a "tier-specific feature parity adjustment." The client still received valid JSON. The HTTP status was still 200. The model identifier in the response matched the model identifier in the request. The only thing that changed was that the field the router consumed to make its escalation decision was no longer there, and the defensive default added during an incident a year earlier had quietly become the production default for every request.
The Defensive Fallback That Became The Production Default
Confidence routing is one of the better cost-control patterns in modern LLM stacks. You ask a cheap model first. You look at how confident the model was in its answer. If the confidence is below a threshold, you escalate the question to a stronger, more expensive model. Done right, you get most of the quality of the strong model at a fraction of the cost. Cascading routing with confidence thresholds is widely cited as cutting inference cost by 45 to 85 percent while preserving the bulk of answer quality.
The confidence signal usually comes from logprobs. Each token the model emits carries a log probability indicating how certain the model was about that token. Aggregating those values across the answer gives you a single number that correlates well with whether the answer is right. The exact aggregation differs by team — minimum logprob over the answer span, mean logprob, perplexity, or a constrained sum over a specific answer fragment — but the shape of the contract is the same. The router reads a field on the response. It applies a threshold. It either returns the cheap answer or it re-asks the expensive model.
This is where the failure mode hides. The router code has to handle a missing field. The two natural choices are to treat absence as "uncertain, escalate" or as "certain, accept." During the incident a year earlier, the provider had a brief outage where partial responses were missing logprobs. The team's router treated the absence as low confidence and escalated every request to the expensive tier, multiplying spend by a factor of four for the duration of the incident. The fix was a defensive fallback: if logprobs are absent, treat the answer as high confidence and skip escalation. The reasoning was reasonable in context — the provider was failing, escalation was expensive, the cheap answer was usually good enough for the duration of an outage. Nobody wrote down the assumption that the fallback was only meant to apply during partial-failure conditions, because at the time those were the only conditions where logprobs could be missing.
A year later the provider revised the tier definition. Logprobs were now a feature of the high tier and not the mid tier. The client kept calling the mid tier. The response shape changed. The defensive fallback, designed to handle a transient failure mode, became the steady-state behavior of the routing decision. Every answer was treated as high confidence. Escalation went to zero. The strong model stopped seeing the hard questions. The mid model started answering them by itself, with the quality you would expect.
The general shape of this bug is worth naming explicitly. A defensive default written for one failure mode silently becomes the production behavior of an unrelated decision when the upstream contract changes. The router was never asked to reason about a world where the field was deliberately removed. The fallback was written for "the field is sometimes missing because the call is failing," not "the field is always missing because the provider decided this tier doesn't include it." The semantics of the absence had flipped, and the code path had no way to tell.
Provider Tier As A Moving Contract Your Finance Team Negotiated
The harder lesson is about where the contract boundary actually lives. Most teams treat the API surface — request shape, response shape, error vocabulary — as the contract. They write code against it, they run integration tests against it, and they trust it to change only when something visible changes, like a major version bump or a public migration guide. What they miss is that the response shape your client receives is a function of two things, not one: the provider's API definition, and the tier of service your account is on.
The tier of service is negotiated by finance. The API definition is consumed by engineering. There is no shared system of record between those two boundaries. Finance sees a quarterly invoice and a contract addendum that talks about per-token rates, throughput ceilings, and feature inclusions described in marketing language. Engineering sees a JSON schema and a status code. When the provider revises a tier — adds a feature here, removes one there — the contract addendum lands in finance's inbox. The JSON schema doesn't change for anyone, because the field that was removed is now described as optional and tier-dependent rather than guaranteed.
This is how a pricing-tier reshuffle becomes a silent client breakage. The provider's release notes flag the change in language designed for procurement readers: "tier-specific feature parity adjustment," "consolidating advanced response fields under the premier tier," "tier alignment with the broader product lineup." Engineering, if they read the release notes at all, reads them as marketing prose. Finance, if they read the release notes at all, reads them as a cost optimization opportunity — possibly even as a reason to negotiate a downgrade. Neither party connects the change to the response field the router has been consuming since the system was built.
A clear pattern for closing this gap is to treat the provider's tier policy as part of the contract that engineering owns. Every response field your code consumes for any decision — routing, retry, cost accounting, auditing — needs an inventory entry that names the tier guarantee that field depends on. When finance evaluates a tier change, the inventory becomes the artifact that names the engineering systems whose behavior depends on the tier, and the change can't ship until each owning team has signed off. This sounds heavy, and it is, but it is the only honest way to handle a contract whose definition the counterparty controls and revises inside the term you committed to.
- https://arxiv.org/abs/2512.03816
- https://proptimiseai.com/blog/measuring-llm-quality-logprobs
- https://gautam75.medium.com/unlocking-llm-confidence-through-logprobs-54b26ed1b48a
- https://ericjinks.com/blog/2025/logprobs/
- https://www.getmaxim.ai/articles/top-5-llm-routing-techniques/
- https://arxiv.org/html/2603.04445v1
- https://arxiv.org/html/2509.14899v1
- https://www.vellum.ai/blog/what-to-do-when-an-llm-request-fails
- https://www.braintrust.dev/articles/turn-llm-production-failures-into-regression-tests
- https://docs.newrelic.com/docs/service-level-management/alerts-slm/
