Skip to main content

252 posts tagged with "reliability"

View all tags

The Agent Timeout Your Users Learned to Game for Refunds

· 9 min read
Tian Pan
Software Engineer

A platform shipped a thirty-minute wall-clock cap on long-running agent tasks, paired with a refund policy that returned the token spend on any task that hit the timeout without producing a deliverable. The intent was protective: a hung agent should not bill the customer. Six months later, the timeout rate had doubled, the engineering team was deep in an "agent reliability" investigation, and the support queue was full of users complaining that the agent "keeps timing out" — with screenshots that showed the user's own browser tab closing at twenty-nine minutes and change.

The unit economics had quietly inverted on a behavioral cohort the finance model never named. The refund population was not a quality population. It was a strategy.

The conversation_id Collision That Swapped Two Users' Contexts at the Gateway

· 10 min read
Tian Pan
Software Engineer

A customer support ticket arrives that reads like a hallucination. The user attached a screenshot: a question they never asked, with their account name at the top, followed by a model response that references files they have never uploaded. The trace looks clean. The model did exactly what was asked of it. The problem is that the question came from a different tenant entirely, and your gateway routed two conversations to the same backend state because their conversation_id values collided.

You do the math on a napkin. UUID v4 has 122 bits of entropy. The birthday-bound probability of any collision in a 50-million-conversation corpus is somewhere south of one in fifty million. You ran the calculation a year ago when you designed the system. The math was correct. The math is still correct. What changed is that two of your backend tiers stopped generating IDs the same way, and the probability the math described was never the probability you were actually running on.

The Downstream API That Kept Writing After the User Cancelled the Conversation

· 10 min read
Tian Pan
Software Engineer

The user hits stop. The browser closes the SSE connection. Your AI SDK fires onAbort. The agent runtime sees the signal, stops requesting more tokens from the model, and tears down its loop. From inside your codebase, the cancellation looks crisp. Every subsystem you can see is doing the right thing.

Meanwhile, two seconds earlier, the model emitted a tool call. The runtime dispatched it. The tool's execute function opened a TCP connection to a third-party API and posted a payload. That HTTP request is still in flight, the third party's server is still processing it, and the third party has no way of knowing that the conversation it is serving no longer exists. The write commits. The user's mental model says they escaped the action by hitting stop. The downstream system's database says otherwise.

The max_tokens Default Your Provider Raised That Doubled Your Tail Response Length

· 12 min read
Tian Pan
Software Engineer

Your incident timeline shows no deploys. Your code did not change. Your traffic mix did not change. Your prompts did not change. And yet your p99 output length doubled inside a week, your downstream rendering layer started clipping responses, and your output-token bill rose 38% on traffic that wasn't asking for longer answers. The change was real, the regression was measurable, and nothing in your version control system records it — because the value that moved was one your code never sent.

The provider raised an implicit default. The release notes filed it under "improved long-form behavior." The parameter in question was max_tokens, which your application has been omitting since day one because the documented default was generous and your outputs rarely came close. The default moved from 4096 to 8192 to accommodate longer reasoning in the provider's newer models. Your application got the new default whether you wanted it or not, because the absence of a parameter is itself a configuration choice — and the provider owns the right to change the value behind it.

This is the failure mode where a "no-op" release on the provider's side propagates through your system as a behavior change, a cost change, and a UX change all at once, and your team's only diagnostic signal is the bill arriving at the end of the month.

The Nightly Batch That Starved Your Interactive Traffic After a Quota Window Rewrite

· 11 min read
Tian Pan
Software Engineer

A cron job that ran cleanly for ten months is the most dangerous job in your system, because nothing in it changed and nothing in your code changed and the only thing that did change was a sentence in someone else's release notes that nobody on your team reads. The nightly embedding refresh that kicked off at 00:05 UTC every night, drained its work queue in under ten minutes, and went back to sleep was textbook. It coexisted with daytime interactive traffic by occupying the freshly-reset minute quota for a few minutes before users woke up, and by staying well under the daily allotment for the rest of the day. Then the provider rewrote how the daily window was accounted, kept the minute window unchanged, and left every signature your client tested against intact. The batch kept running clean. The interactive surface started returning 429s at 00:13 UTC every night. The team chased an upstream maintenance window that wasn't happening for a week.

The bug was never in your code. The bug was that "a daily limit" stopped meaning what it had meant the day before, and your scheduler was pinned to a wall-clock boundary that aligned with the old meaning. This post is about rate-limit accounting as a contract the provider can revise without breaking any signature, about how two independently-correct schedules compose into a denial-of-service pattern, and about the architectural moves that make a cron job stop being a time bomb wired to someone else's clock.

The Rate-Limit Headers Your Provider Returned That Disagreed With The Actual Throttle

· 10 min read
Tian Pan
Software Engineer

The response header said you had 480,000 tokens-per-minute of headroom. The 429 arrived after you spent 240,000. Your scheduler had been autoscaling against a number the runtime was never going to honor, and the burndown chart on the wall was reading the documentation while the throttler was enforcing something else entirely.

This is one of those failures that takes a long time to even notice, because every component along the path is doing exactly what it advertised. The provider returns a header. Your client parses it. Your scheduler reads it. Your dashboard plots it. None of these layers is broken. What is broken is the assumption that the header is a contract.

The Retry Budget That Hid Your Provider's Actual Error Rate From Your Dashboard

· 11 min read
Tian Pan
Software Engineer

The weekly review slide said 99.9%. The invoice said the bill had tripled. The two numbers had been on adjacent dashboards for months, and nobody had noticed that they were measuring different worlds. The reliability number was post-retry — every call that eventually returned a 200 counted as a success — and the cost number was every attempt the client made, billed by the token. Between them sat a generous five-attempt retry loop and a provider whose tail latency had been quietly degrading. The first time anyone looked at both numbers together was during an outage, when the cost-anomaly alert fired before the availability alert did.

That is the whole pattern. A retry budget that looks like a reliability mechanism is also a cost-quality knob, and the team that watches only one side of it is paying for an availability number the invoice will eventually correct.

The Tool Schema Migration That Broke Your Agent's Retries for Two Weeks

· 11 min read
Tian Pan
Software Engineer

The deprecation notice went out on a Tuesday. The downstream team rotated the response shape on their search tool — results[].snippet became results[].excerpt, a clean rename, six-week window, banner in the docs, three reminder emails to the engineering list. Every human consumer migrated. The agent did not, because the agent does not read email. For fourteen days the retry loop quietly parsed the new payload, found the field it was looking for missing, raised a KeyError, and counted that as a retryable failure. The retry hit the same endpoint, got the same new shape, raised the same error, gave up after three attempts, and returned an apologetic message to the user. The retry budget dashboard stayed green the entire time — retries were never exhausted, they were just permanently failing within budget. Success rate, measured at the tool layer, sat at zero on that path. Nobody looked because there was no page.

This is the shape of the failure that gets the most engineers in 2026: not the dramatic outage, but the silent contract drift where a human-facing migration runs to completion and the agent-facing one never starts because nobody knew there was one to run. The deprecation worked exactly as designed for the consumers it was designed for. The agent was a consumer nobody listed.

Your Eval Suite Is a Production Workload: When Nightly Tests Starve Live Traffic

· 11 min read
Tian Pan
Software Engineer

A team's most successful AI feature went dark at 2:14 AM on a Tuesday. The pager said the model API was returning 429s in steady state. The model was healthy. The provider was healthy. The team's own production traffic was nominal. What was eating the quota was the nightly eval suite — the same suite the team had been proudly expanding the previous week. The eval and the product shared an organization key, and on that night the eval was the noisy neighbor that broke its own roommate.

The eval wasn't misbehaving. It was doing exactly what its authors designed: a thousand cases against the production model identifier, on a cadence, on a schedule everyone had forgotten about because it had been quiet for two years. The expansion that finally pushed it over the limit added three hundred cases. The PR was reviewed by the eval owner and the prompt owner. Nobody on the review thread thought to ask: how much of the daily token quota does this consume?

The 429 Whose Body Said OK And Your Client Believed The Body

· 9 min read
Tian Pan
Software Engineer

The outage started at 14:03 with a 429 from the provider and a JSON body that said {"status": "ok", "data": null}. The client library was written in a hurry six months ago by someone who had been burned twice before — once by a gateway that returned HTTP 200 with an error field, and once by a provider that returned HTTP 500 on a request that had actually succeeded. So the library learned to trust the body, not the status. The status said throttle. The body said proceed. The client believed the body, fired the next request, got another 429 with another ok, fired again, and by 14:11 the provider's circuit breaker had blacklisted the account for the rest of the hour.

The provider hadn't lied, exactly. The 429 was real. But somewhere in the response pipeline a default envelope had been merged over the rate-limit payload — a generic {"status": "ok"} from a wrapper service that filled missing fields, applied on top of an error the wrapper didn't recognize. The status code was correct, the headers were correct, the body was wrong, and the body was the part the client read.

The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped

· 11 min read
Tian Pan
Software Engineer

A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by get_order in step 1 has status shipped, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in pending. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.

This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (if status == "shipped"); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.

The Agent's I-Don't-Know Rate That Fell After You Added More Tools

· 9 min read
Tian Pan
Software Engineer

You added the search tool, then the calendar tool, then the CRM tool, then four database wrappers and a calculator. The dashboard moved the way you wanted: task-completion ticked up, latency held, the "I don't know" rate dropped from 14% to 4%. Looks like a capability win. It is not. The planner did not learn more; it learned less abstention. Every question now looks answerable because there is always some tool that pattern-matches the query well enough to call. The 10 percentage points of "I don't know" you removed did not turn into correct answers — they turned into confident wrong ones, distributed across the long tail where nobody is grading carefully.

This is the false-competence trap of tool surface expansion. It is the most common way a team ships a regression while celebrating an improvement. The eval rubric measures whether the agent attempted the task and produced a plausible-shaped answer; it does not measure whether the agent should have refused. Abstention is not free, but it is the cheapest correct behavior available, and you stop being able to see it the moment your tool palette gets large enough that something always fires.