The Share-Nothing Agent: Designing AI Agents for Horizontal Scalability
Your load balancer assigns an incoming agent request to replica 3. But the user's conversation history lives in memory on replica 7. Replica 3 has no idea what has happened in the last six turns, so it starts over, confuses the user, and your on-call engineer gets paged at 2 AM. You add sticky sessions. Now all requests for that user route to replica 7 forever. You've traded a correctness bug for a scalability ceiling.
This is the moment teams realize that "horizontal scaling" for AI agents is not the same problem as horizontal scaling for web servers. The fixes are different, and the naive paths fail in predictable ways.
The root cause is implicit state. Most agent implementations accumulate context inside the process — conversation history, scratch files, intermediate tool outputs, checkpoint data — and never think about what happens when that process disappears or a second process tries to handle the same user. This is fine for a prototype. It breaks badly in production.
The fix comes from distributed systems, not AI research: the share-nothing architecture principle. Design each agent replica as if it has no memory of prior work. Bring all required context in through external systems. Write all output back to those systems. Make every operation safely retryable. When you do this consistently, you get proper horizontal scalability — add replicas, distribute load, handle failures gracefully. This post walks through how to get there.
Why Agents Accumulate State (and Why That Matters)
LLMs are stateless by design. Every inference call takes a complete prompt and produces a response. There is no hidden session object inside the model. This is why LLM providers can serve millions of users: each request is independent.
Agents break this property. The agent framework wraps LLM calls with a control loop that maintains:
- Conversation history: the sequence of turns needed for context
- Tool outputs: results from previous calls that inform current reasoning
- Working files: documents, logs, data files written during execution
- Checkpoint data: enough state to resume if the loop is interrupted
When this data lives in process memory or on the local filesystem, you've created a stateful process. Adding a second replica doesn't help — it doesn't have access to the first replica's memory. You need sticky sessions, which means one replica handles all requests from a given user or session. You've bought yourself operational headaches with no scaling benefit.
The problem compounds as agents get more capable. An agent handling a 20-step research task might accumulate hundreds of kilobytes of intermediate context. An agent that's been running for 90 minutes has built up tool outputs, partial results, and reasoning steps that exist nowhere except on the machine where it's running. When that machine restarts, everything is lost.
The Share-Nothing Principle Applied to Agents
Share-nothing is a distributed systems architecture where each node operates without assumptions about shared state. Nodes don't share memory. They don't share local disk. They coordinate only through explicit message passing — APIs, queues, databases. The classic implementation is the web request: a request arrives at any server in the pool, that server fetches whatever it needs from a database, processes the request, writes results back, and returns. The next request can go to any server.
For agents, the equivalent principle is: every agent invocation must be able to run on any available replica. This requires three things:
All context must live outside the process. Conversation history, tool results, and checkpoint data go into external storage — Redis for fast reads, databases for durability, object storage for large artifacts. The agent fetches what it needs at the start of each turn and writes results back at the end.
Context reconstruction must be possible from metadata. You shouldn't need to replay the entire history to resume mid-task. Instead, store metadata about each step — what happened, what decisions were made, pointers to stored artifacts — so a new replica can reconstruct sufficient context in milliseconds.
All tool calls must be safely retryable. When a tool call fails partway through, or when an agent restarts from a checkpoint, it may call the same tools again. Tools that are not idempotent corrupt state on retry.
These three requirements are independent and each can be addressed separately. But you need all three to achieve full horizontal scalability.
External State: The Three Tiers
Production agent systems typically organize external state into three tiers based on access speed and cost.
Hot context (Redis): The current turn's working context — conversation history for the active session, recent tool outputs, the agent's current plan. Redis lookups run in 10–50ms, which matters when you're targeting sub-200ms response times for interactive agents. This is the data that every turn needs immediately. Redis also provides TTL-based expiration, so session state cleans up automatically when a conversation ends.
Warm context (vector databases and relational stores): Historical context, retrieved facts, prior session summaries. This isn't needed on every call — only when the agent decides it needs background information. Fetching takes 50–200ms, which is acceptable as an on-demand operation. The access pattern here is retrieval-based: the agent queries for relevant prior context rather than loading everything.
Cold storage (object storage): Large artifacts — documents the agent processed, full tool output logs, archived session data. These are rarely accessed, and when they are, a few seconds of latency is fine. The agent stores a pointer in hot context rather than the artifact itself: instead of embedding a 50KB tool response in the context window, store it in S3 and keep a URL.
The practical split between these tiers varies by use case. Interactive agents with many short sessions use Redis heavily and rarely touch object storage. Batch processing agents running 30-minute workflows use more relational storage for durability and more object storage for intermediate artifacts.
The key design discipline: nothing important lives only in process memory. When an agent writes a file to /tmp, that decision is a technical debt. When tool results accumulate only in the in-memory conversation object, that's a scaling liability.
- https://www.ruh.ai/blogs/stateful-vs-stateless-ai-agents
- https://tacnode.io/post/stateful-vs-stateless-ai-agents-practical-architecture-guide-for-developers
- https://medium.com/@ritwikranjan/building-stateless-ai-agents-that-scale-a-design-pattern-for-github-copilot-on-azure-84a78f0fc626
- https://redis.io/blog/ai-agent-memory-stateful-systems/
- https://redis.io/blog/ai-agent-orchestration/
- https://dev.to/inboryn_99399f96579fcd705/state-management-patterns-for-long-running-ai-agents-redis-vs-statefulsets-vs-external-databases-39c5
- https://fast.io/resources/ai-agent-idempotent-operations/
- https://dzone.com/articles/idempotency-in-ai-tools-most-expensive-mistake
- https://datahub.com/blog/context-management/
- https://developers.googleblog.com/architecting-efficient-context-aware-multi-agent-framework-for-production/
- https://www.pluralsight.com/resources/blog/ai-and-data/architecting-microservices-agentic-ai
- https://thenewstack.io/scaling-ai-agents-in-the-enterprise-the-hard-problems-and-how-to-solve-them/
- https://www.databricks.com/blog/memory-scaling-ai-agents
- https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns
