Skip to main content

Agent Traffic Is Not Human Traffic: Designing APIs for Two Species of Caller

· 11 min read
Tian Pan
Software Engineer

The API you shipped two years ago was designed for a single species of caller: a person, behind a browser or a mobile client, clicking once and waiting for a response. That assumption is now wrong on roughly half of every interesting endpoint. The other half of the traffic is agents — your own, your customers', third-party integrations using your endpoints as tools — and they have different physics. They burst. They retry forever. They parallelize. They parse error strings literally. They act on behalf of a human who will not be available to clarify intent when something breaks.

Most of the production weirdness landing in postmortems this year traces back to one architectural mistake: treating both species as the same caller class. Rate limits sized for human pacing get blown apart by an agent's parallel fanout. Error messages designed to be human-readable get parsed wrong by an agent that retries forever on a 400. Idempotency assumptions that humans satisfy by default get violated when an agent retries the same payload from a recovered checkpoint. Auth logs lose the ability to distinguish "the user did this" from "the user's agent did this on the user's behalf."

The fix is not a smarter WAF or a bigger rate-limit bucket. It is a deliberate API design that names two caller classes, treats their traffic as different shapes, and records the delegation chain so accountability survives the indirection.

The two species have different traffic physics

A human session is a slow, sparse, mostly-coherent walk through a UI. A request rate of two or three per second is a power user. Errors get read; retries are manual; the next request usually depends on the last one in a way the user can articulate. Bursts come from page loads or pagination, not from autonomy.

An agent session is the opposite. A single autonomous task can chain ten to twenty sequential or parallel API calls — a tool lookup, a retrieval-augmented generation query, a permission check, a write, a verification read, a follow-up enrichment — all inside a few seconds. When the agent decides it needs to enrich a list of fifty items, you get fifty parallel calls in the same window. When the agent is interrupted and resumes, you may get the same call again with the same payload because its checkpoint did not record completion.

This is not a defect of the agent. It is the agent doing its job. But the gateway that enforces "100 requests per minute per API key" was sized against human pacing, and it now treats a legitimate agent workflow as abuse. Industry data from the past year suggests AI-driven traffic has crossed the 50% line on many public endpoints, and a noticeable share of organizations have admitted they are essentially blind to non-human traffic — they cannot tell which of their callers is a robot, much less whether the robot is theirs or somebody else's.

The first design move is to stop pretending the two distributions overlap. They do not. An agent's burst-then-silent shape and a human's steady drip are not noise around the same mean; they are different distributions, and a single rate-limit policy is the average of two things you should be measuring separately.

Caller-class identification belongs at the gateway

Before any other discipline can land, the gateway must know which species is calling. The cheapest mechanism is a typed credential: API keys or service tokens are tagged at issuance with a caller_class of human, agent, or service, and the gateway reads that tag on every request. A slightly richer mechanism is an OAuth claim — modern providers already distinguish user principals from service principals, and you can lift that distinction into a header your downstream services can read without re-doing auth.

The honest version of this design also captures a third axis: who the agent is acting for. An agent operating autonomously on behalf of an internal job is not the same caller as an agent operating on behalf of a specific customer. The OAuth idiom for this is the on-behalf-of (OBO) flow, and the audit-relevant fact is the pair (agent_id, human_principal_id) rather than either one alone. A token that says "this is agent X" is half the answer; the other half is "delegated by user Y at time Z with scope S."

A few practical consequences:

  • Caller-class header on every internal hop. The gateway extracts the class from the credential and forwards a X-Caller-Class (or equivalent) header to downstream services. Internal services should never have to re-parse the original token to know what shape of caller they are serving.
  • Cryptographically verifiable agent credentials. The 2026 reality is that user-agent strings get spoofed at scale. Recent reporting suggests a meaningful share of traffic claiming to be well-known AI agents does not originate from those operators' infrastructure. If your trust decisions hinge on "is this really ChatGPT," you need a signed credential, not a string match.
  • Provisioning and rotation matter. Agent credentials need to be issuable, scopable, and rotatable without human ceremony. A workflow where a human operator generates a key from a UI does not survive a fleet of a thousand agents.

Rate limits, idempotency, and error envelopes — designed for the burst-and-retry shape

Once the caller class is legible, three policy surfaces start making sense.

Rate limits sized to consumption, not requests. The unit "requests per minute" was a proxy for cost back when requests were cheap and roughly uniform. An agent that hits a single retrieval endpoint can cost orders of magnitude more compute than a human GET on the same path, because the retrieval call fans out across an embedding model, a vector store, a re-ranker, and a generation step. Token-based limits — counting input tokens, output tokens, or compute units — track the actual resource consumption. Pair them with burst capacity that explicitly permits the agent's fanout shape: a sustained limit and a short-window burst limit are not the same policy.

Idempotency required, not optional. Humans achieve idempotency by accident — they see a spinner, they wait, they don't double-click. Agents achieve idempotency only if the protocol forces them to. Every mutating endpoint in an agent-callable surface should require an Idempotency-Key header, store the request hash and response on first execution, and replay the stored response on duplicate keys. The cost is one row in a key-value store with a short TTL; the benefit is that an agent's recovered checkpoint cannot quietly create the same order twice. Make this required for caller_class=agent and optional for caller_class=human if you want to keep the human surface unchanged.

Error responses an agent can act on. A 400 with a string body that says "we couldn't process this request, please try again later" is a trap. The agent will try again, and again, because the prose suggests transience but the status code says "your request is wrong." RFC 9457 (Problem Details for HTTP APIs) defines a structured envelope — type, title, status, detail, plus extension fields — that is parseable by an agent without an LLM in the loop. The minimum upgrade is to put a machine-actionable next step in the response: a typed error code (invalid_idempotency_key, rate_limited_burst_window, requires_user_consent), a retry_after if the error is genuinely transient, and a retry: false if it is not. An agent that knows "this is permanent, do not retry" stops blowing up your error budget.

The retry-after header deserves its own line. When you set it, agents that follow the spec will honor it and your storm dies down. When you don't, agents will pick their own backoff and a thundering-herd retry from a thousand callers becomes a self-inflicted DDoS. Jitter on the agent side helps; sending an explicit retry-after from the server side is what makes the whole system convergent.

Audit and observability: the delegation chain has to survive

The hard part of agent observability is not "log more." It is "log enough to reconstruct who decided what." For a human-only API, the audit log answers one question: which user did this? For an agent-mediated API, it has to answer three: which user authorized this, which agent acted on the user's behalf, and which model and prompt produced the decision.

A practical audit record for an agent-driven mutation looks something like:

  • actor.human_principal_id — the user whose authority is being exercised
  • actor.agent_id — the agent identity that issued the call
  • actor.delegation_id — the OBO grant or session that ties them together, with scope and expiry
  • actor.model_version and actor.prompt_id — what was actually deciding (frozen, not aliased)
  • request.idempotency_key, request.parent_trace_id — for replay and chain-of-custody
  • decision.policy_version — which version of the authorization policy approved this

When something goes wrong six weeks later, you want to be able to answer "did the user click the thing, or did the agent decide based on a prompt-injected document," and the only way to answer is to have recorded both. The current generation of audit-logging guidance for agentic systems is converging on this shape: trigger identity (the human who initiated the workflow run) is the chain-of-custody anchor, and every downstream action references its delegation lineage so an auditor can walk back to the originating principal.

Observability dashboards have to make the same split. Latency SLOs that average human and agent traffic are calibrated against neither. Anomaly detection that does not know agent traffic is bursty by nature will fire on every legitimate workflow. The minimum useful split is two top-level dashboards — one per caller class — with the per-endpoint cardinality you already have. Most teams discover the split was free; they were already emitting the caller-class label, they just were not grouping by it.

Agents are the new inside attacker

An agent compromised by a prompt injection is not a confused user. It is an inside attacker with the user's full scope, executing tools the user is authorized to call, with a credential the user delegated. OWASP's 2025 list put prompt injection at the top for a reason: it shows up in the majority of audited LLM deployments, and the agent layer is exactly where it cashes out.

The API design implication is that least privilege has to extend down to the agent's credential, not stop at the user's. A user who can read and write across their entire workspace is fine for a UI session — the human is in the loop on every action. The same scope handed to an agent means a single injected document can issue a delete that the user never intended. Two changes flow from this:

  • Agent credentials should be narrower than the user's. Issue a per-task delegation with scope limited to the resources the task actually needs. The OBO flow with explicit scope and short TTL is the right shape. The audit record then reflects that the agent only had the scope it was supposed to have, which both contains the blast radius and makes incident response tractable.
  • High-risk endpoints should require a human-in-the-loop confirmation step that the agent cannot satisfy alone. Wire transfers, account deletions, permission grants — these belong behind a confirmation token that an agent can request but only a human can complete. A 4xx response with a requires_human_confirmation typed error code, and a redirect URL the user can resolve, is a perfectly reasonable answer.

The audit chain matters here too. When the postmortem asks "did the user click the thing or did the agent decide based on a prompt-injected document," the only way to answer is to have recorded both, with enough fidelity to replay the decision. "The agent did it" is not an answer; "the agent did it under delegation D, prompted by document X retrieved from store Y at time Z, against model version M" is.

The architectural realization

The API you ship today is serving two species of caller with different traffic shapes, different failure modes, different security threats, and different recovery patterns. Pretending they are one is the source of every "why is this endpoint suddenly so flaky" investigation, every rate-limit incident that started as a feature launch, every audit gap that surfaces a quarter after a customer asks who deleted what.

The work is not glamorous. A caller-class header at the gateway, idempotency keys on mutating endpoints, RFC 9457 error envelopes with typed codes and retry hints, audit records that capture the full delegation chain, dashboards split by caller class, and least-privilege scopes on agent credentials. None of it is novel; all of it is correct. The teams that ship it now stop firefighting the agent traffic they already have. The teams that don't will keep discovering, one incident at a time, that "the API" became two APIs while they were not looking.

References:Let's stay in touch and Follow me for more thoughts and updates