Skip to main content

Stateful Multi-Turn Conversation Infrastructure: Beyond Passing the Full History

· 11 min read
Tian Pan
Software Engineer

Every demo of a conversational AI feature does the same thing: pass a list of messages to the model and print the response. The happy path works, looks great in a Jupyter notebook, and gets you a green light to ship. Then you get to production, and your p99 latency starts creeping up during peak hours. A month later, a customer complains that the assistant "forgot" everything from earlier in the session. Six weeks after that, your session store hits its memory ceiling during a product launch.

The fundamental problem is that "pass the full conversation history" is not a session management strategy. It is the absence of one.

The Quadratic Trap

The attention mechanism that makes LLMs powerful has a cost: computational expense scales quadratically with input length. Double the conversation history, and the inference cost roughly quadruples. At a 500 requests-per-second load with a 1-second p95 target, any single component that hits 200ms p99 latency starts dominating your tail numbers. Conversation history that seemed manageable in testing becomes the bottleneck you didn't plan for.

The token explosion is faster than intuition suggests. A 20-turn conversation accumulates 5,000–10,000 raw tokens from message content, tool call results, and assistant reasoning. The actual information needed to ground the next response is typically 500–1,000 tokens: the user's current intent, a few recent exchanges establishing context, and any explicit commitments made earlier in the session. The rest is redundancy that costs latency and money.

Research confirms that performance degradation follows a nonlinear curve. Frontier models show measurable accuracy drops starting around 1,000 tokens of context — well below advertised context windows — with some models failing on tasks that could fit in 100 tokens when buried inside a long history. The "Lost in the Middle" effect compounds this: models systematically favor tokens at the start and end of the input window, so older turns don't just cost money — they actively dilute the signal from recent turns.

Three Places State Can Live

Before choosing a compression strategy, you need to decide where your session state lives. These are not equivalent choices.

In-memory stores (Redis) give you sub-millisecond read latency and built-in TTL management. Redis's semantic session managers can even retrieve only contextually relevant message slices rather than the full history, using embedding similarity to select what matters for the current query. The cost is durability: if your session store restarts, active conversations die. For most chat applications this is acceptable; for anything with consequential multi-step workflows (booking, purchasing, multi-day research threads), it isn't.

Distributed databases (DynamoDB, Postgres) persist conversation records durably and handle horizontal scale better than a single Redis instance. The tradeoff is p99 latency: a DynamoDB read under normal load is fine, but at high concurrency with large session payloads, you will see tail spikes that don't appear in unit tests. If you're running session state retrieval on the hot path of every inference call, choose your data store based on p99 not average performance.

Hybrid approaches separate hot from cold state. The current session's recent turns live in Redis for fast access; older conversation history, extracted facts, and cross-session memory live in a durable store. Retrieval on the hot path hits the cache; background processes write-through to the persistent layer. This is more operational complexity, but it's what production deployments that need both durability and low latency actually use.

History Compression That Actually Works

The core decision is between truncation, compression, and retrieval.

Truncation is the default behavior of most frameworks: drop the oldest messages when the context window fills. It is simple, predictable, and wrong for most applications. Users lose context at exactly the moment a conversation is getting complex. The model "forgets" commitments made earlier in the session. Worse, the failure is silent — the model doesn't tell the user it no longer has access to the earlier part of the conversation.

Sliding window with rolling summary is the most pragmatic approach for most teams. Keep the last N turns verbatim — typically 8–15 exchanges, depending on average turn length — and maintain a compressed summary of everything older. A secondary model call summarizes the oldest batch of messages into 200–300 tokens of condensed context ("User is building a data pipeline for a healthcare client; wants HIPAA-compliant storage; has decided against Snowflake due to cost"). That summary gets prepended to the active window on each new turn. The quadratic explosion becomes manageable because the window stays bounded.

The operational catch: the summarization call adds latency and cost on every Nth turn. Budget for it. Make it async if your application can tolerate a brief period where the summary hasn't yet been updated.

Selective retention takes a more deliberate approach. Before compressing, classify message content by importance:

  • Must-retain verbatim: explicit user preferences, key decisions, commitments from the assistant, anything the user has corrected
  • Can summarize: routine clarifying exchanges, repeated questions, context that's now superseded
  • Can discard: acknowledgments, filler turns, information that's now irretrievable from source (if you have RAG, you don't need to store the retrieved chunks — just the retrieval key)

Importance classification can be rule-based for simple applications or LLM-based for complex ones. A small classifier model running asynchronously costs far less than keeping every token in context.

Retrieval-augmented history moves conversation state entirely into a vector store. Instead of injecting raw history, you embed the current query and retrieve the semantically relevant exchanges. The benefit: effectively unlimited conversation depth with bounded context cost. The cost: retrieval latency, embedding infrastructure, and the non-obvious failure mode where the vector search returns the wrong historical exchanges. A user asking "what did we decide about the deadline?" gets back the turn about the budget deadline when they meant the project deadline.

The p99 Session That Breaks Your System

Every production AI application has a p99 session that nobody planned for. It's the customer support conversation that goes 200 turns because the user has a genuinely complicated problem. It's the research assistant session that accumulates tool outputs across two hours of autonomous operation. It's the user who picks up a conversation three days later and expects the model to remember every detail.

The naive approach hits a hard limit and the API returns an error. The user sees a generic failure message and loses their work. This is the worst outcome.

The graceful outcome requires designing the session store with explicit budget accounting. Track token consumption per turn and maintain a running total. When the budget crosses a threshold — say, 60% of the context window — begin aggressive compression of older history. At 80%, switch from rolling summary to key-fact extraction: discard the narrative structure entirely and retain only the essential facts. At 90%, trigger a user notification that you're in a summarized context mode.

The operational runbook for when a session store blows past its size budget starts with these questions:

  1. Is this a p99 outlier or a systematic shift? Check whether the budget overrun is isolated to one session type or affecting a cohort.
  2. Is the compression pipeline keeping up? Async summarization can fall behind during traffic spikes, causing sessions to grow faster than they're being compressed.
  3. Is the session store itself the bottleneck? High-cardinality concurrent sessions can exhaust Redis memory faster than eviction policies can keep up.

The answer to all three determines whether the fix is a policy change (adjust compression thresholds), an infrastructure change (scale the session store), or a product change (add a session reset prompt for unusually long conversations).

What the LLM Message API Actually Expects From You

A detail that bites teams late: most LLM APIs are stateless. The Claude Messages API, OpenAI Chat Completions API, and similar interfaces do not maintain session state. You send the full conversation each time; they return the next message. The state management is entirely your responsibility.

This means your session infrastructure is not optional scaffolding — it's the production system you're building. The "messages" array in your API call is an output of your session management layer, not a passthrough of raw storage.

Concretely: the session store holds the authoritative record of what happened. Before each inference call, you run a retrieval-and-compression step that constructs the messages array: system prompt, compressed history of old turns, verbatim recent turns, and the current user message. The API sees a well-formed, budget-compliant context. The actual stored history may be far larger than what you're sending on any given turn.

Production Monitoring for Conversation State

Once you've built session management, you need to observe it. The metrics that matter:

Token consumption per turn tells you whether your sessions are growing predictably or spiking unexpectedly. A sudden increase suggests users are pasting large payloads, tool calls are returning verbose outputs, or a code change inadvertently changed what gets included in history.

Session size at percentile (p50, p90, p99) gives you headroom visibility. If your p99 session is at 60% of your designed budget, you have runway. If it's at 85%, you're one viral week away from a production incident.

Compression pipeline lag measures how far behind your async summarization process is falling. During traffic spikes, if compression falls behind and sessions grow faster than they're compressed, you'll see storage costs spike before you see accuracy issues.

History retrieval latency for hybrid architectures tracks the p99 cost of pulling session state on the hot path. Any regression here directly affects user-facing latency.

Emit these as structured telemetry — OpenTelemetry spans work well for tracing the session retrieval and compression steps as distinct operations. When you get paged at 2am because conversation quality has degraded, you want to know immediately whether the problem is in retrieval (wrong history being injected), compression (too much being dropped), or inference (the model behavior itself).

The Conversation Topology You Didn't Design For

Two-way chat is the easy case. The failure modes get interesting when the conversation topology diverges from the simple user-assistant alternation:

Long tool-use chains where an agent makes dozens of tool calls in sequence inflate context with tool inputs and outputs that users never see. Each tool result gets added to the message history, and the raw payloads can be enormous. The fix: summarize tool outputs before storing them. Keep the result, not the raw JSON.

Multi-user sessions — shared contexts where multiple humans talk to the same agent — have no standard infrastructure pattern. Most production AI products silently drop this use case. The fundamental challenge is conflict resolution: two users submitting messages simultaneously requires serialization, and the agent's state must remain consistent across concurrent writes.

Session resumption after days or weeks requires explicit handling that's different from within-session compression. When a user returns to a month-old conversation, the agent should inject a context-refresh summary rather than retrieving verbatim old turns. "Last time you were working on X and had decided Y" is more useful than injecting 200-turn history.

The Infrastructure You're Actually Signing Up For

A production conversation infrastructure that holds up has these components:

A session store with p99 read latency under 20ms and durable write confirmation. A compression pipeline that runs asynchronously and catches up during low-traffic windows. A turn-by-turn token accounting system that fires compression at configurable thresholds. A context assembly layer that constructs the messages array from compressed history, verbatim recent turns, and retrieved relevant facts. Monitoring with percentile-tracked metrics for session size, token consumption, and retrieval latency.

This is not complex infrastructure. But it is infrastructure, not application code, and teams that treat it as a 20-line helper function discover the limits under production load.

The toy demo works because your users are careful, your conversations are short, and nothing goes wrong. Production conversations are longer, messier, and more consequential. The session management layer is what turns a demo into a system.

References:Let's stay in touch and Follow me for more thoughts and updates