Skip to main content

Stateful Multi-Turn Conversation Infrastructure: Beyond Passing the Full History

· 11 min read
Tian Pan
Software Engineer

Every demo of a conversational AI feature does the same thing: pass a list of messages to the model and print the response. The happy path works, looks great in a Jupyter notebook, and gets you a green light to ship. Then you get to production, and your p99 latency starts creeping up during peak hours. A month later, a customer complains that the assistant "forgot" everything from earlier in the session. Six weeks after that, your session store hits its memory ceiling during a product launch.

The fundamental problem is that "pass the full conversation history" is not a session management strategy. It is the absence of one.

The Quadratic Trap

The attention mechanism that makes LLMs powerful has a cost: computational expense scales quadratically with input length. Double the conversation history, and the inference cost roughly quadruples. At a 500 requests-per-second load with a 1-second p95 target, any single component that hits 200ms p99 latency starts dominating your tail numbers. Conversation history that seemed manageable in testing becomes the bottleneck you didn't plan for.

The token explosion is faster than intuition suggests. A 20-turn conversation accumulates 5,000–10,000 raw tokens from message content, tool call results, and assistant reasoning. The actual information needed to ground the next response is typically 500–1,000 tokens: the user's current intent, a few recent exchanges establishing context, and any explicit commitments made earlier in the session. The rest is redundancy that costs latency and money.

Research confirms that performance degradation follows a nonlinear curve. Frontier models show measurable accuracy drops starting around 1,000 tokens of context — well below advertised context windows — with some models failing on tasks that could fit in 100 tokens when buried inside a long history. The "Lost in the Middle" effect compounds this: models systematically favor tokens at the start and end of the input window, so older turns don't just cost money — they actively dilute the signal from recent turns.

Three Places State Can Live

Before choosing a compression strategy, you need to decide where your session state lives. These are not equivalent choices.

In-memory stores (Redis) give you sub-millisecond read latency and built-in TTL management. Redis's semantic session managers can even retrieve only contextually relevant message slices rather than the full history, using embedding similarity to select what matters for the current query. The cost is durability: if your session store restarts, active conversations die. For most chat applications this is acceptable; for anything with consequential multi-step workflows (booking, purchasing, multi-day research threads), it isn't.

Distributed databases (DynamoDB, Postgres) persist conversation records durably and handle horizontal scale better than a single Redis instance. The tradeoff is p99 latency: a DynamoDB read under normal load is fine, but at high concurrency with large session payloads, you will see tail spikes that don't appear in unit tests. If you're running session state retrieval on the hot path of every inference call, choose your data store based on p99 not average performance.

Hybrid approaches separate hot from cold state. The current session's recent turns live in Redis for fast access; older conversation history, extracted facts, and cross-session memory live in a durable store. Retrieval on the hot path hits the cache; background processes write-through to the persistent layer. This is more operational complexity, but it's what production deployments that need both durability and low latency actually use.

History Compression That Actually Works

The core decision is between truncation, compression, and retrieval.

Truncation is the default behavior of most frameworks: drop the oldest messages when the context window fills. It is simple, predictable, and wrong for most applications. Users lose context at exactly the moment a conversation is getting complex. The model "forgets" commitments made earlier in the session. Worse, the failure is silent — the model doesn't tell the user it no longer has access to the earlier part of the conversation.

Sliding window with rolling summary is the most pragmatic approach for most teams. Keep the last N turns verbatim — typically 8–15 exchanges, depending on average turn length — and maintain a compressed summary of everything older. A secondary model call summarizes the oldest batch of messages into 200–300 tokens of condensed context ("User is building a data pipeline for a healthcare client; wants HIPAA-compliant storage; has decided against Snowflake due to cost"). That summary gets prepended to the active window on each new turn. The quadratic explosion becomes manageable because the window stays bounded.

The operational catch: the summarization call adds latency and cost on every Nth turn. Budget for it. Make it async if your application can tolerate a brief period where the summary hasn't yet been updated.

Selective retention takes a more deliberate approach. Before compressing, classify message content by importance:

  • Must-retain verbatim: explicit user preferences, key decisions, commitments from the assistant, anything the user has corrected
  • Can summarize: routine clarifying exchanges, repeated questions, context that's now superseded
  • Can discard: acknowledgments, filler turns, information that's now irretrievable from source (if you have RAG, you don't need to store the retrieved chunks — just the retrieval key)

Importance classification can be rule-based for simple applications or LLM-based for complex ones. A small classifier model running asynchronously costs far less than keeping every token in context.

Retrieval-augmented history moves conversation state entirely into a vector store. Instead of injecting raw history, you embed the current query and retrieve the semantically relevant exchanges. The benefit: effectively unlimited conversation depth with bounded context cost. The cost: retrieval latency, embedding infrastructure, and the non-obvious failure mode where the vector search returns the wrong historical exchanges. A user asking "what did we decide about the deadline?" gets back the turn about the budget deadline when they meant the project deadline.

The p99 Session That Breaks Your System

Every production AI application has a p99 session that nobody planned for. It's the customer support conversation that goes 200 turns because the user has a genuinely complicated problem. It's the research assistant session that accumulates tool outputs across two hours of autonomous operation. It's the user who picks up a conversation three days later and expects the model to remember every detail.

The naive approach hits a hard limit and the API returns an error. The user sees a generic failure message and loses their work. This is the worst outcome.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates