The Share-Nothing Agent: Designing AI Agents for Horizontal Scalability
Your load balancer assigns an incoming agent request to replica 3. But the user's conversation history lives in memory on replica 7. Replica 3 has no idea what has happened in the last six turns, so it starts over, confuses the user, and your on-call engineer gets paged at 2 AM. You add sticky sessions. Now all requests for that user route to replica 7 forever. You've traded a correctness bug for a scalability ceiling.
This is the moment teams realize that "horizontal scaling" for AI agents is not the same problem as horizontal scaling for web servers. The fixes are different, and the naive paths fail in predictable ways.
The root cause is implicit state. Most agent implementations accumulate context inside the process — conversation history, scratch files, intermediate tool outputs, checkpoint data — and never think about what happens when that process disappears or a second process tries to handle the same user. This is fine for a prototype. It breaks badly in production.
The fix comes from distributed systems, not AI research: the share-nothing architecture principle. Design each agent replica as if it has no memory of prior work. Bring all required context in through external systems. Write all output back to those systems. Make every operation safely retryable. When you do this consistently, you get proper horizontal scalability — add replicas, distribute load, handle failures gracefully. This post walks through how to get there.
Why Agents Accumulate State (and Why That Matters)
LLMs are stateless by design. Every inference call takes a complete prompt and produces a response. There is no hidden session object inside the model. This is why LLM providers can serve millions of users: each request is independent.
Agents break this property. The agent framework wraps LLM calls with a control loop that maintains:
- Conversation history: the sequence of turns needed for context
- Tool outputs: results from previous calls that inform current reasoning
- Working files: documents, logs, data files written during execution
- Checkpoint data: enough state to resume if the loop is interrupted
When this data lives in process memory or on the local filesystem, you've created a stateful process. Adding a second replica doesn't help — it doesn't have access to the first replica's memory. You need sticky sessions, which means one replica handles all requests from a given user or session. You've bought yourself operational headaches with no scaling benefit.
The problem compounds as agents get more capable. An agent handling a 20-step research task might accumulate hundreds of kilobytes of intermediate context. An agent that's been running for 90 minutes has built up tool outputs, partial results, and reasoning steps that exist nowhere except on the machine where it's running. When that machine restarts, everything is lost.
The Share-Nothing Principle Applied to Agents
Share-nothing is a distributed systems architecture where each node operates without assumptions about shared state. Nodes don't share memory. They don't share local disk. They coordinate only through explicit message passing — APIs, queues, databases. The classic implementation is the web request: a request arrives at any server in the pool, that server fetches whatever it needs from a database, processes the request, writes results back, and returns. The next request can go to any server.
For agents, the equivalent principle is: every agent invocation must be able to run on any available replica. This requires three things:
All context must live outside the process. Conversation history, tool results, and checkpoint data go into external storage — Redis for fast reads, databases for durability, object storage for large artifacts. The agent fetches what it needs at the start of each turn and writes results back at the end.
Context reconstruction must be possible from metadata. You shouldn't need to replay the entire history to resume mid-task. Instead, store metadata about each step — what happened, what decisions were made, pointers to stored artifacts — so a new replica can reconstruct sufficient context in milliseconds.
All tool calls must be safely retryable. When a tool call fails partway through, or when an agent restarts from a checkpoint, it may call the same tools again. Tools that are not idempotent corrupt state on retry.
These three requirements are independent and each can be addressed separately. But you need all three to achieve full horizontal scalability.
External State: The Three Tiers
Production agent systems typically organize external state into three tiers based on access speed and cost.
Hot context (Redis): The current turn's working context — conversation history for the active session, recent tool outputs, the agent's current plan. Redis lookups run in 10–50ms, which matters when you're targeting sub-200ms response times for interactive agents. This is the data that every turn needs immediately. Redis also provides TTL-based expiration, so session state cleans up automatically when a conversation ends.
Warm context (vector databases and relational stores): Historical context, retrieved facts, prior session summaries. This isn't needed on every call — only when the agent decides it needs background information. Fetching takes 50–200ms, which is acceptable as an on-demand operation. The access pattern here is retrieval-based: the agent queries for relevant prior context rather than loading everything.
Cold storage (object storage): Large artifacts — documents the agent processed, full tool output logs, archived session data. These are rarely accessed, and when they are, a few seconds of latency is fine. The agent stores a pointer in hot context rather than the artifact itself: instead of embedding a 50KB tool response in the context window, store it in S3 and keep a URL.
The practical split between these tiers varies by use case. Interactive agents with many short sessions use Redis heavily and rarely touch object storage. Batch processing agents running 30-minute workflows use more relational storage for durability and more object storage for intermediate artifacts.
The key design discipline: nothing important lives only in process memory. When an agent writes a file to /tmp, that decision is a technical debt. When tool results accumulate only in the in-memory conversation object, that's a scaling liability.
Idempotent Tool Design
Agents retry tool calls more often than engineers expect — somewhere between 15% and 30% of calls get retried due to timeouts, validation errors, or model uncertainty. When an agent is restarted from a checkpoint and resumes a partially completed workflow, it will re-call tools it already called. When a network error drops the response after the server-side operation completed, the agent doesn't know the call succeeded.
Non-idempotent tools make these scenarios dangerous. A tool that creates a database record, sends an email, or charges a credit card will corrupt state when retried. The consequences range from duplicate data to financial errors.
The fix is idempotency keys, the same pattern used in payment APIs. The agent generates a unique key for each logical tool call — typically a hash of the session ID, the step number, and the tool's input parameters. The tool server checks whether it has already processed a request with that key. If yes, it returns the stored result without re-executing. If no, it executes, stores the result, and returns it.
This provides two properties: safety under retry (the operation runs at most once, even if the caller sends it multiple times) and an audit trail (every completed tool call is recorded with its result, which helps enormously during debugging).
The idempotency key also solves a subtler problem: model hallucination. Language models occasionally forget they've already called a tool and try to call it again in the same turn. With idempotency keys, this is handled transparently — the second call returns the same result as the first without any new side effects.
A practical note on classification: most tools break into three categories. Read-only tools are naturally idempotent — query a database, fetch a URL, read a file. Write tools that create resources can be made idempotent by checking for existence first or by using idempotency keys. A small fraction of tools have irreversible effects (sending an external message, for example) that require explicit idempotency infrastructure. Start with the first two categories and handle the third with explicit key management.
Reproducible Context Reconstruction
State migration between replicas fails at scale. Serializing a full agent state object and shipping it across the network is slow, fragile, and expensive. A better approach is context reconstruction: store enough metadata that any replica can rebuild sufficient context for the current turn without migrating the full state.
This looks like event sourcing in practice. After each major step, the agent appends an event to a durable log: what decision was made, which tools were called and returned what, what the current plan is, pointers to artifacts. The event log is an immutable record that any replica can replay — not by re-executing tool calls, but by reading stored results and reconstructing the agent's current understanding.
When a new replica takes over a session, it doesn't start from scratch and it doesn't need the original replica's memory. It reads the event log, loads recent context from Redis, fetches any needed artifacts from object storage, and reconstructs a working context window. This typically takes a few hundred milliseconds — fast enough for a retry, slow enough that you don't want to do it on every turn.
The practical implication: summarization is not just a token-saving optimization, it's a scalability enabler. As conversations grow, a raw replay of all prior turns consumes too many tokens and takes too long. Periodically summarizing the conversation into a compact representation — stored in the warm context tier — keeps reconstruction fast and keeps context windows bounded.
Failure Modes That Are Invisible Without This Architecture
The most dangerous failure mode of stateful agents isn't a crash — it's silent quality degradation. When an agent accumulates incorrect assumptions in its memory, or when context drifts from reality over many turns, or when state synchronization fails between replicas, the result is not an error. No exception is thrown. No alert fires. The agent just starts producing worse results, and users start routing around it.
This is hard to detect precisely because it looks like model quality variation, not a system failure. Teams often attribute it to the model being non-deterministic, or to their prompt being imprecise, when the actual cause is corrupted state that's been accumulating for hours.
Share-nothing architecture prevents this class of failure because there's no persistent in-process state to corrupt. Each turn fetches clean context from external systems, runs the model, and writes results back. Context drift doesn't accumulate because context isn't accumulated by default — it's reconstructed explicitly.
A related failure is sticky session saturation. When all traffic from a high-volume user routes to one replica, that replica becomes a bottleneck. The user experiences high latency or timeouts. Other replicas sit idle. Load balancers are designed for stateless services; sticky sessions break the load balancing assumption and create uneven load distributions that are difficult to reason about.
The Practical Migration Path
Most teams don't start with share-nothing agents — they start with prototypes that become production systems before the architecture changes. Migration is possible incrementally.
Start by externalizing conversation history. Replace in-memory conversation state with a Redis-backed session store. This alone allows you to remove sticky sessions and run multiple replicas correctly. It's usually the highest-leverage change.
Next, add idempotency to write tools. Audit your tool set for tools that have side effects. Add idempotency key support to those tools and update the agent to generate and pass keys. This makes your agent resilient to retries and model hallucination.
Finally, add checkpoint events for long-running workflows. After each major decision point, write a checkpoint event to a durable log. This enables safe resume from mid-task and provides the audit trail needed for debugging.
Each step is independently valuable. The full share-nothing architecture is the destination, but the path is incremental.
What This Doesn't Solve
Share-nothing architecture solves horizontal scalability for agent compute. It does not solve the cost of external storage at scale — Redis clusters for millions of concurrent sessions get expensive. It doesn't solve the cold reconstruction problem: the first request after a long idle period requires fetching context from slower storage, which adds latency. And it doesn't solve the fundamental token cost of long conversations: even summarized context requires tokens, and very long interactions eventually hit practical limits.
These are real constraints, but they're engineering problems with known solutions — storage tiering, TTL policies, proactive cache warming, aggressive summarization. They're preferable to the scalability ceiling imposed by stateful agents, which has no known solution other than redesigning the architecture.
The stateless web server pattern took years to become the default for web applications. The share-nothing agent pattern is going through the same transition now. Teams that design for it from the start will find it straightforward; teams that retrofit it onto stateful systems will find it painful but necessary.
The principle is simple: build your agents like you build your APIs — stateless by default, with explicit external state management. Everything else follows from that.
- https://www.ruh.ai/blogs/stateful-vs-stateless-ai-agents
- https://tacnode.io/post/stateful-vs-stateless-ai-agents-practical-architecture-guide-for-developers
- https://medium.com/@ritwikranjan/building-stateless-ai-agents-that-scale-a-design-pattern-for-github-copilot-on-azure-84a78f0fc626
- https://redis.io/blog/ai-agent-memory-stateful-systems/
- https://redis.io/blog/ai-agent-orchestration/
- https://dev.to/inboryn_99399f96579fcd705/state-management-patterns-for-long-running-ai-agents-redis-vs-statefulsets-vs-external-databases-39c5
- https://fast.io/resources/ai-agent-idempotent-operations/
- https://dzone.com/articles/idempotency-in-ai-tools-most-expensive-mistake
- https://datahub.com/blog/context-management/
- https://developers.googleblog.com/architecting-efficient-context-aware-multi-agent-framework-for-production/
- https://www.pluralsight.com/resources/blog/ai-and-data/architecting-microservices-agentic-ai
- https://thenewstack.io/scaling-ai-agents-in-the-enterprise-the-hard-problems-and-how-to-solve-them/
- https://www.databricks.com/blog/memory-scaling-ai-agents
- https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns
