Stateful Conversations at Database Scale: The Session Store Architecture Every Production Chat Feature Needs
Most engineers shipping chat features discover their session architecture is wrong in production, not in design review. The demo ran fine: you tested with five messages, the conversation history fit in memory, and the LLM responded coherently. Then you launched, and somewhere between the first thousand concurrent sessions and the first deployment rollout, users started experiencing forgotten context, partial responses, or conversations that reset without warning. The in-memory pattern that makes chat features trivial to prototype is precisely what makes them fragile to operate.
This is not a subtle architectural mistake. Conversation state is fundamentally different from request state. Request state lives for milliseconds; conversation state must survive pod restarts, horizontal scaling, deployment cycles, and mobile network interruptions — for minutes, hours, or days. Building on the wrong abstraction creates reliability debt that compounds as conversation length grows and user load increases.
The In-Memory Failure Taxonomy
When conversation history lives only in application process memory, it inherits the process lifetime. That means any of the following destroys it: a pod crash, an autoscaler scaling in, a deployment rolling pods, a serverless function timing out. Stateless infrastructure is the norm for good reason — but conversation history is state, and it needs to be treated as such.
The second failure mode is subtler: context window overflows. When you append raw message history to every LLM request without bound, you eventually hit the model's token limit. That can fail hard with an API error, or fail soft with silent truncation — where the LLM drops older messages without telling you, and the user gets responses that seem to have forgotten earlier context. Neither failure mode is visible in a staging environment with short test conversations.
The third failure mode runs counter to intuition: too much context degrades quality. Teams that naively persist everything and stuff the full history into every prompt find accuracy dropping to 40–45% on factual recall tasks — worse than no memory at all — because semantically distant old turns pollute the context. Full history retrieval consumes roughly 26,000 tokens per conversation; smart retrieval that selects relevant past exchanges consumes about 1,800 tokens and produces better results. More context is not always better context.
Finally, there is the race condition category: front-end state management bugs that send two concurrent POST requests to the same session — one with full context, one with only the latest message — causing non-deterministic context loss. This failure looks like intermittent model confusion but is actually a distributed state problem.
The Tiered Storage Pattern
The production solution that handles all these failure modes is tiered storage: hot memory for active context, persistent storage for durability, and archival for cost control.
The hot tier is Redis. It holds the last 20–50 messages for active sessions with a TTL aligned to your session timeout. Redis delivers sub-millisecond reads and writes, which matters because context assembly is on the critical path of every LLM request. Key structure: {session_id}:{user_id} mapping to a bounded message stack, with a separate {user_id}:summary key for the accumulated summary of older turns.
The persistent tier is a document store or relational database — DynamoDB, PostgreSQL, and MongoDB are all common choices. This is the source of truth. When Redis evicts a session or a pod restarts cold, the system reloads from here. Latency is 10–50ms, acceptable for cold-start context assembly.
The cold tier is object storage or archival database. Sessions older than 30–90 days live here at minimal storage cost. DynamoDB's TTL feature can automatically expire message items after 90 days, keeping the persistent tier lean.
When a request arrives and the Redis key is missing (cache miss), the system loads the most recent N messages plus the running summary from the persistent tier, warms Redis, and proceeds. The LLM only ever sees the reconstructed bounded context, not a growing raw transcript.
Compaction: Preventing Unbounded Growth
Even with tiered storage, you need a strategy for what happens as conversations accumulate. The naive answer — keep everything forever and load it all — produces the token cost and quality degradation problems described earlier. The production answer is compaction: systematically reducing older context while preserving meaningful information.
The most common approach is batch summarization. When a conversation accumulates 20 new messages, a background process triggers LLM summarization of that batch, writes the summary to the persistent tier, and removes the raw messages from the active window. New requests load the running summary plus the current working batch. The summary can itself be re-summarized hierarchically as it grows. This approach uses a smaller, cheaper model (such as GPT-4o-mini or an equivalent) for the summarization step, keeping costs low.
More sophisticated systems use token-count thresholds rather than message counts. When the assembled context exceeds a configurable token budget, a compaction pipeline runs in order of aggressiveness:
- Collapse verbose tool call results (no LLM required, minimal information loss)
- Summarize the oldest spans (LLM call, moderate information loss, preserves intent)
- Apply a sliding window to keep only the most recent N turns (hard cutoff)
- Truncate oldest groups as a backstop (maximum information loss)
One constraint that trips teams up: you cannot remove a tool call message without removing its paired tool result, or vice versa. LLM APIs reject sequences with orphaned function calls. Compaction logic must treat these as atomic units or it will produce API errors that look like model failures.
Memory formation is a more selective alternative to summarization. Rather than summarizing everything, the system extracts durable facts — preferences, names, decisions, constraints — and stores them in a structured key-value or graph store. On future requests, these extracted facts are injected as a compact system prompt block rather than replayed as conversation history. Benchmark comparisons show memory formation achieves 80–90% token reduction and a 26% improvement in response quality compared to summarization alone, because facts are recalled precisely rather than paraphrased through multiple layers of compression.
Data Model Details That Matter
The DynamoDB design used in production at scale demonstrates the data modeling constraints you will eventually hit. The hard 400 KB per-item size limit rules out the naive pattern of storing all messages as a nested list inside a single conversation item. Any non-trivial conversation exceeds this limit.
The correct model uses vertical partitioning: conversation metadata (user ID, title, created timestamp, settings) lives in one item class; each individual message lives in its own item. This means writing "hi" costs exactly one write capacity unit regardless of the total conversation size. Message items use a ULID (Universally Unique Lexicographic Identifier) as the sort key — a 26-character, lexicographically sortable identifier that enables efficient range queries over message sequences without secondary indexes.
Schema:
Conversation: PK=<UserID> SK=CONV#<ConversationID>
Message: PK=<UserID> SK=CHAT#<ConversationID>#MSG#<ULID>
The access patterns this supports: list all conversations for a user (query on PK prefix), load the most recent 20 messages for a session (query on PK + SK range, reverse sort, limit 20), append a new message (put item). All primary-key access, no scans.
For PostgreSQL users: the analogous pattern is a conversations table and a messages table with a foreign key, indexed on (user_id, conversation_id, created_at). OpenAI's ChatGPT session store runs on a single PostgreSQL primary with approximately 50 geo-distributed read replicas, handling hundreds of millions of users at low double-digit millisecond p99 read latency. The workload is heavily read-biased — many context lookups per message written — which maps naturally to a single primary with replica fan-out. No exotic distributed database required.
Cross-Device Session Continuity
Most production implementations handle a single-session, single-device model well. The architecture breaks down when users switch devices mid-conversation, close a mobile app and reopen it, or access the same session from a desktop and phone simultaneously.
The root problem is that session state is typically tied to a connection, not to a user identity. When the connection drops and reconnects, the server either (a) has no record of the session, requiring the user to start over, or (b) has stale context that doesn't reflect what happened during the gap.
The solution is channel-based session architecture: state lives in a durable channel keyed by user identity, not by connection ID. Any device subscribing to the channel receives the same message history. Devices subscribe and unsubscribe as they connect and disconnect; the channel persists independently.
Specific patterns for cross-device:
- Store complete AI responses in channel history, not individual streaming tokens. When a new device reconnects, it replays from channel history efficiently.
- Track device presence. When no devices are connected, pause any running agent generation to avoid burning tokens on output nobody will read.
- Handle conflicting states with a canonical source of truth in the persistent tier — the channel is a view of the database, not a replacement for it.
Among AI platforms evaluated in 2025, 32 of 37 had no meaningful multi-device fan-out capability. It is a commonly skipped feature that becomes a significant support burden as mobile usage grows.
When to Add Which Layer
Not every chat feature needs all of this. A rough decision matrix:
In-memory session state is sufficient when: conversations are short (under 10 turns), sessions don't need to survive server restarts, you're prototyping or in early access.
Add Redis + persistent storage when: you hit your first deployment that drops user conversations, sessions need to survive pod cycling, or you're serving more than a few hundred concurrent users.
Add compaction when: average conversation length exceeds 20–30 turns, token costs from context assembly are measurable in your billing, or you observe quality degradation in long conversations.
Add vector retrieval when: you need non-linear memory (retrieve what's relevant, not what's recent), sessions span days or weeks with idle periods, or you're building a personal assistant that needs to recall facts from months ago.
Add workflow-per-conversation durability when: conversations run for hours, span external integrations (email sends, database writes), and require exactly-once semantics for in-progress steps. The complexity cost is significant; only worth it when conversation continuity is a core product guarantee.
The Architectural Seam Most Teams Miss
The compaction strategy and the data model are implementation details. The real architectural decision is where session state lives in your system and who owns the contract for loading and hydrating it.
Teams that bolt context management onto the request handler — assembling context inside the same function that calls the LLM API — end up with context logic scattered across features, duplicated compression code, and no clean way to test the assembly logic independently. Teams that extract a SessionContextLoader abstraction — a single component responsible for loading bounded context given a session ID — end up with testable, replaceable context strategies.
The session store architecture is worth designing before you hit the scaling wall because retrofitting it means rewriting both the storage layer and every place that touches conversation history. Unlike most infrastructure debt, this one surfaces as a user-visible product bug — forgotten context — rather than a silent performance regression. By the time users are reporting it, the refactor is urgent.
The good news is that the patterns are well-established. Tiered storage with Redis hot cache and PostgreSQL or DynamoDB persistence handles the durability problem. Batch summarization or memory formation handles the token cost problem. ULID-keyed vertical partitioning handles the data model problem. None of these are novel; they are the distributed systems building blocks you already know, applied to a statefulness problem that happens to involve language models.
- https://github.com/aws-samples/managing-chat-history-and-context-at-scale-in-generative-ai-chatbots
- https://aws.amazon.com/blogs/database/amazon-dynamodb-data-models-for-generative-ai-chatbots/
- https://openai.com/index/scaling-postgresql/
- https://www.infoq.com/news/2026/02/openai-runs-chatgpt-postgres/
- https://render.com/articles/real-time-ai-chat-websockets-infrastructure
- https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
- https://serokell.io/blog/design-patterns-for-long-term-memory-in-llm-powered-architectures
- https://dev.to/inboryn_99399f96579fcd705/state-management-patterns-for-long-running-ai-agents-redis-vs-statefulsets-vs-external-databases-39c5
- https://learn.microsoft.com/en-us/agent-framework/agents/conversations/compaction
- https://temporal.io/blog/building-a-persistent-conversational-ai-chatbot-with-temporal
- https://ably.com/blog/ai-session-continuity-cross-device-channel-based-architecture
- https://redis.io/blog/build-smarter-ai-agents-manage-short-term-and-long-term-memory-with-redis/
- https://www.pinecone.io/learn/series/langchain/langchain-conversational-memory/
