Your APIs Assumed One Human at a Time. Parallel Agents Broke the Contract.

April 26, 2026 · 12 min read

Software Engineer

A backend engineer I know spent a Tuesday afternoon staring at a Datadog graph that had never spiked before: the per-user 429 counter on their internal calendar service. The customer complaining had not changed their behavior. They had simply turned on the assistant feature, which now spawned eight planning threads in parallel against the same calendar API every time the user said "find me time next week." The rate limiter — a perfectly reasonable 60 requests per minute per user, written years ago against a UI that physically could not click that fast — was firing within the first three seconds of every request and silently corrupting half the assistant's responses.

The rate limit was not the bug. The contract was the bug. That backend, like most internal services written before 2024, had a quietly enforced assumption baked into every layer: one user means one stream of activity, paced by a human's reaction time, with one cookie jar, one CSRF token, and one set of credentials that could be re-prompted if anything went wrong. Agents shred all five of those assumptions at once, and the failures show up as a constellation of unrelated incidents — 429 storms, last-write-wins corruption, audit logs you can't subpoena, re-auth loops that hang headless workers — that nobody connects until the pattern is named.

The shorthand I have been using with platform teams is this: every backend you own has an undocumented contract with its callers, and that contract was negotiated with humans. Agents are now showing up to renegotiate. You can either do the renegotiation deliberately, in code review, or you can do it during your next incident.

The five assumptions agents break (and why they're invisible until they fire)

Single-user-at-human-speed is not one assumption. It is at least five, layered through different parts of the stack, owned by different teams, each defensible in isolation:

Rate limits are shaped for steady human cadence. A real person clicks, reads, types, clicks again. A 100 req/min limit is generous for that. An agent fan-out planner can dispatch 500 requests in ten seconds and then be silent for five minutes. The token bucket fills perfectly across the five-minute window and the limiter still fires constantly, because the cadence is wrong, not the volume.
Idempotency is treated as the client's problem. "If you double-submit, that's on you" works when "you" is a human who notices the double-charge and complains. When "you" is a planning agent that retries on a transient 502 by re-running its tool call from the top, the server will quietly create two of everything and the agent will report success. The Idempotency-Key header has been an IETF draft since 2021 and most internal APIs still treat it as optional.
Sessions and CSRF tokens assume one cookie jar. Single-page-app session models lean on a per-browser cookie and a CSRF token bound to that session. Spawn ten parallel agent workers against the same logical user and you have ten cookie jars or one shared jar with ten concurrent writers — both modes break things the original auth designer never tested.
Audit logs record an action, not a chain of authority. "User U updated record R at timestamp T" was sufficient when U was a person who could be asked what happened. When U is an OAuth principal acting on behalf of a human acting on behalf of a service account that the human authorized last quarter, "user U did it" is a lie of omission that compliance will eventually catch.
Locking semantics are last-write-wins because two-tab humans were rare. A user opening the same record in two browser tabs and editing both was an edge case worth ignoring. Three agents writing to the same record in the same second is now the modal case, and your "we'll just use last-write-wins" decision from 2019 is now silently dropping data.

None of these are exotic. Each one is something a senior engineer would defend on its own. The problem is that all five hold simultaneously, and an agent workload tests all five at once on the very first day it is enabled.

The rate-shape mismatch is not a tuning problem

The first instinct when 429s spike is to raise the limit. This is wrong, and it is wrong in a way that costs more than it saves.

Consider what the rate limiter is actually for. Two jobs, mostly: protect the backend from a single tenant exhausting capacity, and constrain abuse from compromised credentials. Both of those jobs are denominated in the same unit — requests per minute per principal — because for a human user, requests per minute is a reasonable proxy for resource consumption and for "is this account behaving like an account, or like an exfiltration script."

Agents decouple those two units. Resource consumption per minute is now bursty and high; behavioral signal is now meaningless because every account looks like a script. Raising the per-minute limit to accommodate the burst means your abuse heuristic is gone and your capacity protection is wishful thinking.

The redesign is to split the budget into two dimensions. Concurrency budget caps how many requests can be in flight at once for a given principal — this is what protects the backend, because in-flight requests directly map to thread pools, database connections, and downstream API quotas. Token bucket caps work over time, but you set it generously, because you have already capped the worst-case fan-out via concurrency. A planner trying to spawn 500 parallel threads against a service with concurrency cap 8 will either queue, get fast 503s with retry hints, or — best — get a 429 with a Retry-After header that the agent's executor knows how to honor. The graph stops being a saw-tooth of false positives.

The second piece is per-tool quotas separate from per-principal quotas. Tool catalogs have wildly different blast radii — a search call costs a millisecond and a public-facing list endpoint, while a "send email" call costs an external API charge and a deliverability reputation. Treating both as "1 unit per request" against the same per-user budget is exactly the abstraction failure that lets a buggy agent burn through your transactional email quota in fifteen minutes.

Idempotency is now a contract, not a feature

A pattern I keep seeing in postmortems: an agent gets a 502 from a backend, retries from the top of its planning loop instead of the failed call, and the backend ends up with two of whatever was being created. The fix is always the same — make the endpoint accept an Idempotency-Key header and store the result of the first attempt — and the response is always the same: "we'll add it to the backlog."

That backlog item should be a P1, because the absence of idempotency is no longer a latent risk. With human users, double-submit was a sometimes-thing that the user noticed. With agent users, retry-on-error is the default behavior of every agent framework on the market. Stripe figured this out a decade ago for payments because the cost of getting it wrong was money; backends that touch any kind of external state — sending a notification, creating a calendar event, modifying a record — are about to learn the lesson on their own time.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Your APIs Assumed One Human at a Time. Parallel Agents Broke the Contract.

The five assumptions agents break (and why they're invisible until they fire)

The rate-shape mismatch is not a tuning problem

Idempotency is now a contract, not a feature

Recommended Reading

About Tian Pan

The five assumptions agents break (and why they're invisible until they fire)​

The rate-shape mismatch is not a tuning problem​

Idempotency is now a contract, not a feature​

Recommended Reading

About Tian Pan

The five assumptions agents break (and why they're invisible until they fire)

The rate-shape mismatch is not a tuning problem

Idempotency is now a contract, not a feature