Skip to main content

Conversation State Is Not a Chat Array: Multi-Turn Session Design for Production

· 10 min read
Tian Pan
Software Engineer

Most multi-turn LLM applications store conversation history as an array of messages. It works fine in demos. It breaks in production in ways that take days to diagnose because the failures look like model problems, not infrastructure problems.

A user disconnects mid-conversation and reconnects to a different server instance—session gone. An agent reaches turn 47 in a complex task and the payload quietly exceeds the context window—no error, just wrong answers. A product manager asks "can we let users try a different approach from step 3?"—and the engineering answer is "no, not with how we built this." These are not edge cases. They are the predictable consequences of treating conversation state as a transient array rather than a first-class resource.

This post is about what conversation state actually needs to be, how production platforms model it, and the API design decisions that keep sessions reliable when sessions last hours and span multiple backend instances.

Why the Array Breaks

The appeal of the message array is that it maps directly to what LLM APIs accept: a list of {role, content} objects. Ship the array, get a response, append it, repeat. This works perfectly until it doesn't.

Context exhaustion is the first failure. At temperature=0 across a long task, your agent makes good decisions for dozens of turns—then the payload crosses the model's context window limit. You either hit a hard API error or, worse, the provider silently truncates older messages. When truncation happens, the model starts hallucinating about decisions it "made" earlier in the session that are now absent from its context. Nothing in your logs indicates this happened.

Attention degradation is a slower failure. Research on major frontier models shows an average 39% performance drop across multi-turn conversations before any context limit is reached. The mechanism is known as "loss in the middle": models attend strongly to the beginning and end of their context, neglecting the middle turns. By turn 20 or 30, earlier constraints and decisions are effectively invisible to the model even though they're still in the array.

Statelessness kills reconnections. A simple message array that lives in client memory—or worse, reconstructed client-side from local storage—cannot survive a server restart, a network partition, or a user switching devices mid-session. Every reconnect starts from scratch. For a short chatbot, users tolerate this. For an agent handling a multi-hour task, it is unacceptable.

Forking is architecturally impossible. Users naturally want to explore alternatives: "What if I take the other approach instead?" With an array, there is no branch point—you either lose the original path or you maintain multiple full copies of history, which is expensive and inconsistent.

Session State as a Resource

The fix is to stop treating conversation state as a payload parameter and start treating it as a server-side resource with its own identity, lifecycle, and operations.

A session resource carries more than the message array. At minimum, it needs:

  • A stable identifier that survives reconnects
  • An execution state machine (idle, running, waiting for tool result, requires human approval)
  • The full message history, append-only and immutable after write
  • Token accounting to let the application layer make explicit decisions about context management
  • A parent session reference to support forking
  • Timestamps for every message and state transition

The key insight is immutable append-only history. Once a message is written to a session, it never changes. New interactions create new messages. This enables time-travel semantics: given any point in history, you can replay forward from that checkpoint. It also prevents a class of bugs where retry logic re-writes a turn with slightly different content.

The Thread Lock Pattern

When a session is actively running—the model is generating, or a tool call is in-flight—new inputs need to be rejected or queued. OpenAI's Assistants API formalizes this as a Thread lock: during an active Run, new messages cannot be appended and new Runs cannot be created. This prevents race conditions where two concurrent inputs corrupt the causal ordering of the conversation.

The pattern translates directly to your own implementation: the session resource has a locked_by field that takes the ID of the current run, and all write operations on the session check this field before proceeding. This is the same pattern as an optimistic lock in database design, applied to conversation state.

Resumption, Forking, and Rollback

First-class session state enables three operations that the array model can't support cleanly.

Resumption means any application instance can continue a session given its ID. The session state lives in a shared store (Redis, Postgres, or a managed equivalent)—not in server memory. When the user reconnects, they retrieve the session by ID. The model sees the same history regardless of which instance handles the request. This is the only architecture that works when you scale horizontally.

Forking creates a new session whose parent is an existing session up to a specific message offset. The fork inherits all history up to that point, then diverges. Both the original and the fork share the immutable prefix; only the diverging messages are new. This halves storage costs compared to deep-copying the full history. More importantly, it makes "try a different approach from turn 5" a first-class product feature instead of an engineering impossibility. Research on conversation branching shows +43% improvement in user-measured quality and 58% reduction in context size for exploratory workflows.

Rollback is a degenerate case of forking: create a fork from an earlier checkpoint, then discard the original branch. For agents, this is essential for error recovery—when the model makes a wrong decision at step 7 of a 12-step task, you need to rewind to step 6 without losing the steps that came before.

Serialization Contracts

When conversation state is a resource, serialization becomes a contract, not an implementation detail. Concretely, this means:

Every session must be exportable to a portable JSON format. The format includes all messages (role, content, type, timestamp), the session's execution state at every transition, and version metadata so consumers can detect schema changes. Agent definitions—the system prompts, tool specifications, and model parameters—are stored separately and re-instantiated at load time. Credentials never enter the serialized state.

The portability requirement has a practical consequence: if you want to move a session from one provider to another (Claude for reasoning, GPT-4o for speed), you need a canonical message schema that both providers can ingest. Today this means translating between provider-specific formats manually, but the message schema is converging on a common structure.

The version field matters more than it seems. When a base model update changes how it interprets tool call results, sessions created before the update may behave differently when resumed. Storing the model version and API schema version in the session lets you detect this at resume time and either warn the user or route the session to a compatible endpoint.

What Storage Tier to Use

The right storage for session state depends on your resumption SLA.

In-memory caches (Redis) give sub-millisecond reads for sessions created in the last hour, which is the dominant case for interactive applications. They lose data on restart unless you configure persistence, so treat them as a hot tier rather than durable storage.

Relational or document databases are the durable tier. Sessions that outlive the Redis TTL are cold-stored and fetched on demand. The append-only message structure is a natural fit for document stores; Postgres's JSONB column or MongoDB work well here.

Workflow engines (Temporal, Prefect) are the right abstraction for long-running, fault-tolerant agentic tasks. The workflow encapsulates the conversation loop; state persistence and retries are handled by the engine. This is overkill for a chatbot but appropriate for multi-hour autonomous tasks where a server restart should not abort work in progress.

Avoid in-memory state on individual application servers. It requires sticky sessions at the load balancer, makes horizontal scaling painful, and guarantees data loss on deployment or crash.

The Token Budget Problem

Storing the full session server-side does not solve the context window problem—it just moves the management responsibility from the client to the server. Your session layer needs explicit token budget logic.

The simplest approach is threshold-based summarization: when the accumulated token count crosses a configurable threshold (say, 80% of the model's context window), the session layer automatically creates a compressed summary of older turns and replaces those turns with the summary in the active context sent to the model. The original messages remain in storage for auditing and replay; the model just doesn't see them.

More sophisticated approaches use semantic relevance to decide which older turns to include when context is tight. Rather than strict chronological truncation, the session layer retrieves the turns most relevant to the current query. This requires maintaining embeddings for each turn, which adds latency but significantly improves model performance on tasks that require reasoning over earlier context.

The key design principle: token budget decisions belong in the session layer, not in the model's lap. The model cannot tell you that it's about to forget something important. The application layer must make explicit choices about what goes into context and surface those choices to the developer.

API Design

Translating these principles into a REST API, a session resource exposes:

  • POST /sessions — create a new session, optionally with initial system context
  • GET /sessions/{id} — fetch session metadata and current state
  • POST /sessions/{id}/messages — append a user message (rejected if session is locked)
  • POST /sessions/{id}/runs — trigger the model turn, returns a run ID
  • GET /sessions/{id}/runs/{run_id} — poll run status and stream output
  • POST /sessions/{id}/fork?at_message={message_id} — create a branch from a given message
  • POST /sessions/{id}/checkpoint — create an explicit named checkpoint for rollback
  • GET /sessions/{id}/export — serialize full session state for portability

Sessions are soft-deleted rather than hard-deleted. Hard deleting a session that a fork depends on would corrupt the fork's history. Retention policies and archiving are separate operations from user-visible deletion.

What This Changes in Practice

The shift from array to resource changes more than your API shape. It changes what you can debug, what your users can do, and what your infrastructure has to guarantee.

Sessions that are portable and replayable mean you can replay a failing production session locally against a new prompt version to verify the fix before deploying it. Sessions with explicit fork support mean product features like "explore a different approach" become straightforward to build. Sessions with token budget management means the model behaves predictably at turn 50 as it does at turn 2.

The array model will take you surprisingly far. It is fine for chatbots where sessions rarely exceed 20 turns, reconnects are uncommon, and users have no expectation of branching. But once you cross into agentic workflows—multi-hour tasks, tool-calling pipelines, workflows that outlive a browser tab—the limitations compound faster than you expect. Design conversation state as a resource from the start, and you avoid a painful refactor when scale reveals the seams.

References:Let's stay in touch and Follow me for more thoughts and updates