Multi-User Shared Agent State: The Concurrency Primitives You Actually Need
Every agent tutorial starts with a single user, a single session, and a single context window. The agent reads state, reasons, acts, writes back. Clean. Deterministic. Completely wrong for anything teams actually use.
Real collaborative products—shared planning boards, multi-user support queues, document co-pilots, team project assistants—require multiple users to interact with the same agent simultaneously. When two people give the agent contradictory instructions within the same second, one of their changes disappears. The agent doesn't tell them. It doesn't even know it happened.
This is the multi-user shared agent state problem, and it's a distributed systems problem dressed in an AI costume.
The Assumption That Breaks at Scale
The root cause is an implicit architecture decision made early in most agent frameworks: agent state is a single mutable object, and the session is the unit of isolation.
That design works perfectly for one user at a time. It breaks for shared workspaces because it treats the agent's working memory like a single-threaded variable. When User A reads the current planning doc state, modifies it via the agent, and writes back—and User B does the same thing 400 milliseconds later—B's write silently overwrites A's. No conflict notification, no merge logic, no error. The agent processed both instructions correctly, from its own perspective.
This is the classic last-write-wins problem. It's solved in distributed databases. It's largely unsolved in AI agent infrastructure. Most teams discover it in production when users complain that their changes "keep getting lost," and it takes weeks to connect that complaint to a race condition rather than a model failure.
The fix is not smarter prompting. The fix is treating agent state updates the same way you'd treat concurrent writes to a distributed datastore: with proper concurrency control.
Optimistic Locking: The Minimum Viable Safety Net
The cheapest correctness improvement is optimistic concurrency control. Instead of locking state before reading, you read freely and then validate at write time that nothing changed since you read it.
The mechanics are simple: every piece of agent state carries a version number. When User A reads the state, they get {..., version: 42}. When their agent action tries to write back, it includes that version number: "apply this change only if the current version is still 42." If User B's write already bumped the state to version 43, A's write is rejected, and the system can retry with fresh state.
In practice this means:
- The agent's tool calls include a
state_versionparameter alongside the actual update payload. - The state store (Redis, Postgres, your document DB) performs a conditional write:
UPDATE agent_state SET ... WHERE id = ? AND version = ?and checks that exactly one row was affected. - On version mismatch, the agent retries the full reasoning step with the current state — not just the write.
That last part matters. If you only retry the write without re-reading and re-reasoning, you'll apply a stale decision to fresh state, which is arguably worse than last-write-wins. The retry must go all the way back to the context-building step.
The practical limit of optimistic locking is contention under high concurrency. If ten users are hammering the same state object simultaneously, retry rates become unacceptable. For most shared workspace use cases — where users coordinate more than they conflict — optimistic locking is sufficient. For hot objects with genuinely concurrent writes, you need a different data model.
Event Sourcing: Immutability as the Foundation
The more durable solution is to stop treating agent state as a mutable object at all. Instead, model every state transition as an appended event in an immutable log.
Under event sourcing, when User A's agent instruction executes, it doesn't overwrite a document. It appends an event: {type: "section_updated", author: "user_a", timestamp: ..., payload: {...}}. User B's concurrent action appends its own event. The current state is always derived by replaying the event stream, not by reading a single record.
This has several properties that matter directly for multi-user agents:
Conflicts become visible and deferrable. When two events arrive that affect the same state, you can detect the conflict explicitly rather than silently dropping one. You can surface it to users, queue it for a human to resolve, or apply a domain-specific merge strategy.
Replay enables debugging. When a user's change seems to have disappeared, you can replay the event stream and show exactly which subsequent event overwrote it. This transforms a mystery into a log entry.
Partial failures are recoverable. If an agent action writes an event but then crashes before updating derived state, you can reconstruct by replaying. Nothing is lost.
The standard implementation pattern: each agent maps to an aggregate in domain-driven design terms. Its event stream is the source of truth. Projections maintain read-optimized views for the agent's context window. Writes go through an event store with optimistic concurrency at the stream level — if two concurrent appends target the same stream at the same expected version, one wins and one retries.
The failure mode to watch for is unbounded event log growth. If every agent action — including intermediate reasoning steps and tool call retries — is stored as an event, the stream can become expensive to replay. Common mitigation: snapshot the aggregate state periodically, and only replay from the most recent snapshot.
CRDTs for Inherently Mergeable State
Some agent state has a structure that makes merging mathematically safe without conflict detection at all. This is where conflict-free replicated data types (CRDTs) become applicable.
A CRDT is a data structure designed so that all concurrent modifications can be merged into a consistent final state regardless of order. Google Docs uses CRDTs for text. Figma uses them for design elements. The same property is useful when multiple users are updating agent state that has set or counter semantics.
Concretely: if your shared agent's state includes a list of active tasks that users can add to concurrently, model it as a grow-only set (G-Set). Any user can add a task, and all adds are automatically merged. No conflicts are possible because deletes don't exist in the model — soft deletion is handled at the application layer by tracking a separate set of "removed" items (this becomes a 2P-Set).
CRDTs are not a universal solution. They work when you can structure state as accumulators — things that grow, counters that increment, last-writer-wins registers for fields where one answer is always correct. They break down when the business logic requires strict ordering of operations or when two concurrent changes are semantically incompatible (e.g., "assign this task to Alice" and "assign this task to Bob" cannot both be true simultaneously).
The practical approach is to layer: use CRDTs for the parts of agent state that accumulate naturally (task lists, preference scores, capability flags), use event sourcing with optimistic locking for the parts that require strict consistency (current plan state, financial commitments, authorization decisions).
The Attribution Model: Who Drove What
Correctness is only half the problem. The other half is accountability. When a multi-user agent takes an action — sends an email, modifies a record, allocates a resource — which user bears responsibility for it?
Most agent implementations store "the agent did this" in their audit logs. That's insufficient. A proper attribution model requires tracking the full principal chain: which user gave the instruction, at which point in the conversation, that led to which agent decision, that triggered which tool call.
This is harder than it sounds because agent reasoning is non-local. User A's instruction in turn 5 may combine with User B's instruction in turn 8 to produce an action in turn 12. Attribution isn't a single pointer — it's a directed acyclic graph of contributing inputs.
A workable implementation approach:
- Assign every user instruction a unique instruction ID.
- When building the agent's context window, annotate each prior turn with its instruction ID and the user who provided it.
- When the agent emits an action, include the set of instruction IDs that were in its context window and marked as causally relevant.
- Store these as structured metadata alongside the action log.
This doesn't require the agent to reason explicitly about attribution. It requires the infrastructure to maintain it. The agent doesn't need to know — the system does.
For compliance use cases (financial services, healthcare, HR systems) this attribution graph is mandatory. You need to be able to answer: "User B gave contradictory instructions to User A's existing plan — which instruction did the agent follow, and why?" If your event log can't answer that, you have a governance gap.
The Permission Boundary Problem
Multi-user agents introduce a permission conflict that single-user agents never face: what happens when two users have different authorization levels and the agent is operating in a shared context?
The naive answer is "the agent operates at the intersection of all users' permissions." In practice this is too restrictive — a team planning agent can't be crippled to the permissions of the lowest-privileged team member.
The more useful model is instruction-scoped permissions. Each user action is executed with that user's permission level. The agent maintains a permission context per user, not per session. When User A's instruction attempts an action, it's evaluated against User A's permissions. When the agent synthesizes a response that requires combining both users' delegated capabilities, it uses the union only where both users explicitly consented to delegate.
This requires the agent's tool execution layer to accept a principal parameter alongside each tool call, not just a session token. It also means your tool implementations need to enforce permissions per-call, not per-session.
Building Multi-User State: A Decision Checklist
Before implementing, decide on these axes:
Conflict rate: How often do concurrent users actually modify the same state object simultaneously? Low contention makes optimistic locking sufficient. High contention requires event sourcing or CRDT-structured state.
Conflict semantics: When two users give contradictory instructions, is there a correct merge answer, or does a human need to decide? Mergeable semantics allow CRDTs. Incompatible semantics require explicit conflict surfacing.
Replay requirements: Do you need to reconstruct what happened and why? If yes, event sourcing is not optional.
Attribution depth: Do you need to know which user's instruction contributed to which action? Shallow attribution (last instruction wins) is cheap; causal attribution graphs require instrumentation at the context-assembly layer.
Permission model: Is it uniform across all users in a shared session, or does it vary per user? Per-user permission contexts require restructuring how your tool execution layer handles authorization.
The Structural Mistake to Avoid
The most common implementation mistake is bolting multi-user support onto a single-user agent architecture as an afterthought. Teams add a session ID field to their state object, figure concurrent users will rarely conflict, and ship.
The failure mode arrives three months later when a customer complains that a shared planning agent corrupted their team's project state. The root cause is a 200-millisecond race condition between two users' edits. It looks like a hallucination. It's not. It's an undetected concurrent write with last-write-wins resolution.
The fix retroactively requires migrating state from a mutable object to an event-sourced model, adding version tracking, and replumbing tool execution to carry principal context. None of that is hard to build — but it's substantially harder to migrate than to build from scratch.
If you know your agent will serve shared workspaces, design for it from the start. Add version numbers to your state schema before you need them. Append events rather than overwriting. Carry user identity through your tool call chain. The incremental cost at design time is small. The retrofit cost at production time is not.
Conclusion
Collaborative agents that serve multiple simultaneous users are not a specialization of single-user agent architecture — they're a fundamentally different concurrency problem. The distributed systems community spent decades solving concurrent writes to shared state: optimistic locking, event sourcing, CRDTs, vector clocks, attribution chains. These solutions port directly to AI agent infrastructure.
The translation work is understanding which distributed systems primitive applies to which layer of your agent. State consistency maps to optimistic locking or event sourcing. Conflict-free accumulation maps to CRDTs. Causal ordering maps to instruction attribution graphs. Permission enforcement maps to per-principal tool execution.
None of this requires new primitives. It requires recognizing that a multi-user agent is a distributed system, and designing it accordingly from the start.
- https://event-driven.io/en/optimistic_concurrency_for_pessimistic_times/
- https://crdt.tech/
- https://redis.io/blog/diving-into-crdts/
- https://www.adopt.ai/glossary/audit-trails-for-agents
- https://www.loginradius.com/blog/engineering/auditing-and-logging-ai-agent-activity
- https://www.ibm.com/think/topics/multi-agent-collaboration
- https://learn.microsoft.com/en-us/azure/architecture/patterns/event-sourcing
- https://eventsourcing.readthedocs.io/en/v3.1.0/topics/examples/concurrency.html
- https://inferensys.com/en/glossary/multi-agent-system-orchestration/state-synchronization/vector-clocks
- https://www.scalekit.com/blog/delegated-agent-access
