Pagination Is a Tool-Catalog Discipline: Why Agents Burn Context on List Returns
Every well-designed HTTP API in your stack returns paginated results. Nobody loads a million rows into memory and hopes for the best. Yet the tools your agent calls return the entire list, and the agent dutifully reads it, because the function signature says list_orders() -> Order[] and the agent has no protocol for "give me the next page" the way a human user has scroll-and-load-more.
The agent burns tokens on rows it could have skipped. The long-tail customer with 50K records hits context-window failures the median customer never sees. The tool author cannot tell from the trace whether the agent needed all those rows or simply could not ask for fewer. And somewhere in your eval suite, the regression that would have flagged this never runs because every test fixture has fewer than 100 records.
Pagination is not a UI affordance. It is a load-shedding primitive — and the agent that consumes a tool without it is reimplementing every SELECT * FROM orders mistake the API designers in your company spent a decade learning to avoid.
The Function Signature Is Lying to the Model
A tool description that returns Order[] is a contract the agent has to honor on faith. The model sees an unbounded array type and a one-line description and assumes the implementation is reasonable. It is not the model's job to ask whether "reasonable" means 50 rows or 50,000.
What the agent actually receives, when it calls a tool that returns a list, is whatever the underlying API decided to give back. For most tools, that is the full result set, because the engineer who wrapped the API was thinking like an integration developer and not like a context budget owner. They wrote the function the way they would write any other server-side helper: take the inputs, return the data, let the caller deal with size.
The caller is now a probabilistic system whose attention drops off in the middle of long inputs and whose token bill scales linearly with whatever the tool decides to emit. That is a different caller. It needs a different contract.
The honest function signature is something like list_orders(cursor?: string, limit?: number) -> { items: Order[], next_cursor?: string, total?: number }. It says: I will give you a window onto the data, you choose the size, and you can come back for more. It treats the agent as a streaming consumer, not a batch receiver.
What "Tools Return Whole Lists" Actually Costs
Consider a customer-success agent that needs to find a specific refund among a tenant's recent activity. It calls list_orders(tenant_id="acme-co"). For 90% of tenants, this returns 50–200 rows and the agent finds the refund in one or two reasoning steps. For the top 1% of tenants, this returns 50,000 rows, the tool response blows past the context budget, and the agent either truncates and loses the refund somewhere in the middle, or hits a hard provider error and returns nothing.
The team running aggregate eval metrics sees a 99% success rate and ships. The 1% segment opens support tickets that engineering cannot reproduce, because the staging tenant has 12 orders in it. The retrospective discovers that the failure rate inside that 1% is closer to 80%, which is the kind of segment-level catastrophe that hides perfectly inside a global average.
Even when the response fits, you are paying for it. A tool returning 20,000 tokens of JSON when the agent needed 200 is a 100x markup on every invocation, before you count the slower time-to-first-token, the eviction of useful prior context, and the increased likelihood that the model's attention will land on the wrong row. The cost is not just the tokens you bought; it is the tokens you needed to keep and could not.
The Datadog team writing about their MCP server found that switching from JSON to YAML cut tabular tool output by roughly 20%, and that paginating by token budget rather than by record count let them fit five times more records in the same context. The interesting part of that story is not the number; it is the realization that "page size" is not a count, it is a budget.
A Pagination Convention for the Tool Catalog
If pagination is going to work as a load-shedding primitive, it has to be a convention, not a per-tool decision. Tools that return lists need a shared protocol the model can learn once and apply everywhere. That means cursor and limit parameters with documented semantics, opaque cursor strings the model is not tempted to decode, and a default limit that is tuned to the agent's typical task — not the API's max.
A few things tend to break when the convention is missing.
Ad-hoc pagination per tool means the model has to reverse-engineer the page-token semantics for every list-returning function. Some tools call it cursor, others call it next_token, others return a page_id, and the model conflates them in the most embarrassing places. The MCP specification's choice to standardize on opaque cursor strings exists precisely because models hallucinate cursor formats when given the option — they will base64-decode an opaque token and try to increment whatever they find inside.
Defaults that match the API's max page size mean the first call to the tool always blows the budget. The API's default of 1,000 is fine for an integration script that streams results to disk. It is wrong for an agent whose typical task only needs the first 20. The default limit should be set to what the agent normally needs, with the option to widen on demand.
Total counts buried inside the items mean the agent has no way to plan. A response that returns 50 items and a total: 50000 field tells the agent it is on the wrong path before it pages 999 more times. A response that just returns 50 items lets the agent confidently iterate into a wall.
Summarize First, Drill In Second
The deeper pattern beneath cursor-and-limit is what some teams have started calling "summarize-then-drill-in." The first call against a list-returning tool should not return rows at all. It should return counts, aggregates, and shape: how many records exist, what their distribution looks like across the dimensions the agent might filter on, what the smallest useful slice is. The second call retrieves the rows the agent actually picked.
This is the same shape as the SQL EXPLAIN-then-SELECT discipline a senior engineer would use against an unfamiliar table. You do not start by reading every row; you start by understanding the shape of the data, then you ask a question that is small enough to be useful.
For an agent, this turns into a tiered tool catalog. count_orders(filters) returns the cardinality. summarize_orders(filters, group_by) returns the buckets. list_orders(filters, cursor, limit) returns the rows, and only after the agent has narrowed the filters enough that the result will be useful. The Datadog team noted that adding a SQL-like query interface — letting the agent ask for specific fields and rows rather than pulling samples and inferring trends — made evaluations roughly 40% cheaper, because the agent stopped fishing and started asking.
The reason this pattern is worth the extra surface area is that it gives the agent something to reason against before it commits the context budget. The agent can say "this filter has 50,000 matches, that is too many, let me narrow further" before it pays for any of them. Without that intermediate step, the agent has only two states: blind and full.
Budget Checks Belong in the Runtime
Even with pagination conventions and a tiered catalog, an individual tool call can still blow the budget. The customer with the long display name, the order with the 200KB notes field, the trace with the deeply nested error payload — all of these can turn a "give me 20 rows" call into a 30,000-token response. The convention bounds the row count, not the row size.
The runtime that sits between the model and the tools is the right place to enforce the byte budget. A simple guardrail — if a tool's response exceeds N% of the remaining context, reject it and return a structured error telling the agent to narrow the request — keeps the worst case from leaking into the model. The error message is the important part. "Tool response was 28K tokens, exceeding the 8K per-call budget. Try reducing limit to 5 or filtering by status='refunded'" lets the agent recover. A bare "context exceeded" forces a guess.
The Memory Pointer pattern is a related idea worth knowing about: when a tool genuinely does need to return a large blob, it stores the blob in external state and returns a short reference key. Subsequent tools can operate on the pointer — search, summarize, extract — without ever putting the full blob into the model's context. The blob lives in the runtime, the model navigates by reference. This is how a streaming consumer behaves; it is not how a list_orders() -> Order[] consumer behaves.
The runtime is also where you put the tool-result clearing logic. A tool call that returned 5,000 tokens of search results 12 turns ago is almost certainly not load-bearing for the agent's current decision. Replace it with a stub — "list_orders returned 200 items, summary preserved" — and free the tokens. Anthropic's published guidance on context engineering treats this as a default discipline rather than an optimization.
The Eval Slice Nobody Has Yet
If your eval suite does not have a long-tail-input slice, the pagination problem will not show up in CI. It will show up in support tickets, three months after the customer with 50K records signs up, and the team will spend a sprint trying to reproduce it on their staging tenant of 12.
The fix is not glamorous. Build a fixture set that includes tenants at the 50th, 95th, and 99.9th percentile of every list-returning dimension your agent touches: orders per tenant, files per repository, messages per thread, rows per table. Run the agent against all of them on every release. Track success rate per percentile, not just globally. The first time the 99.9th percentile drops by 30 points while the global average barely moves, you will have found a pagination regression — the kind that used to ship to production before anyone noticed.
The other half of the discipline is tracing. The trace for a tool call needs to record both the count of items returned and the bytes consumed, separately, so you can see when the agent is paying for size rather than count. A tool that returns 10 items but 40,000 tokens is doing the wrong thing, and the only way to spot it across a fleet of agents is to have the trace data and the alerting on it.
Pagination Is the Contract, Not a Feature
The summary the team needs to internalize: an agent that consumes a list-returning tool without a pagination protocol is consuming an unsafe API. The fact that it works on staging is not evidence it is safe; it is evidence that staging happens to be small. The fact that the global accuracy metric is high is not evidence it is reliable; it is evidence that the average customer is small.
Pagination is the contract that says "the tool will hand you a window, you decide how big the window is, and you can come back for more." It is a load-shedding primitive that the API designers in your company already know how to write. The job is to extend that discipline to the tool catalog the agent calls — the one that today is full of list_* functions returning unbounded arrays because nobody on the team has been the on-call engineer for the customer with 50K records.
The agent that gets pagination right scales to whatever segment shows up. The one that does not is a product whose reliability is determined by the size of its customers, which is the same as saying its reliability is determined by who happens to file the next ticket.
- https://www.datadoghq.com/blog/engineering/mcp-server-agent-tools/
- https://blog.jetbrains.com/ruby/2026/02/rubymine-mcp-and-the-rails-toolset/
- https://modelcontextprotocol.io/specification/2025-03-26/server/utilities/pagination
- https://github.com/microsoft/mcp-for-beginners/blob/main/04-PracticalImplementation/pagination/README.md
- https://www.mrphilgames.com/blog/mcp-is-wasting-your-context-window
- https://arxiv.org/html/2511.22729v1
- https://redis.io/blog/context-window-overflow/
- https://platform.claude.com/cookbook/tool-use-context-engineering-context-engineering-tools
- https://github.com/vercel/ai/discussions/8193
- https://arxiv.org/html/2511.17006v1
- https://www.lunar.dev/post/why-is-there-mcp-tool-overload-and-how-to-solve-it-for-your-ai-agents
