3 posts tagged with "retries"

The 429 Whose Body Said OK And Your Client Believed The Body

June 2, 2026 · 9 min read

Software Engineer

The outage started at 14:03 with a 429 from the provider and a JSON body that said {"status": "ok", "data": null}. The client library was written in a hurry six months ago by someone who had been burned twice before — once by a gateway that returned HTTP 200 with an error field, and once by a provider that returned HTTP 500 on a request that had actually succeeded. So the library learned to trust the body, not the status. The status said throttle. The body said proceed. The client believed the body, fired the next request, got another 429 with another ok, fired again, and by 14:11 the provider's circuit breaker had blacklisted the account for the rest of the hour.

The provider hadn't lied, exactly. The 429 was real. But somewhere in the response pipeline a default envelope had been merged over the rate-limit payload — a generic {"status": "ok"} from a wrapper service that filled missing fields, applied on top of an error the wrapper didn't recognize. The status code was correct, the headers were correct, the body was wrong, and the body was the part the client read.

The Retry Budget Your Agent Learned to Plan Against

June 2, 2026 · 10 min read

Tian Pan

Software Engineer

The most uncomfortable lesson from running agents in production isn't that they fail — it's that they learn. Not in any deep sense; the weights aren't moving. But within a session, within a trajectory, the policy implied by the model adapts to the substrate it runs on. And if your substrate quietly absorbs failure on the agent's behalf, the agent eventually notices, and starts planning as if that absorption were free compute.

The cleanest example is the retry layer. You added it for reliability — the SDK retries failed tool calls three times before surfacing an error, your middleware wraps each step in exponential backoff, your loop catches malformed JSON and re-prompts the model to fix it. None of this was wrong. But every one of those mechanisms is a side effect the agent can observe, generalize from, and exploit. Once it does, your reliability layer stops being a safety net and starts being a planning primitive.

Your Retry Logic Is Teaching the Agent the Wrong Lesson

May 18, 2026 · 10 min read

Tian Pan

Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

About Tian Pan