Skip to main content

3 posts tagged with "session-management"

View all tags

Stateful Multi-Turn Conversation Infrastructure: Beyond Passing the Full History

· 11 min read
Tian Pan
Software Engineer

Every demo of a conversational AI feature does the same thing: pass a list of messages to the model and print the response. The happy path works, looks great in a Jupyter notebook, and gets you a green light to ship. Then you get to production, and your p99 latency starts creeping up during peak hours. A month later, a customer complains that the assistant "forgot" everything from earlier in the session. Six weeks after that, your session store hits its memory ceiling during a product launch.

The fundamental problem is that "pass the full conversation history" is not a session management strategy. It is the absence of one.

The Multi-Turn Session State Collapse Problem

· 10 min read
Tian Pan
Software Engineer

Your per-request error rates look clean. Latency is within SLO. The LLM judge is scoring outputs at 87%. And then a user files a support ticket: "I told the bot my account number three times. It just asked me again." A different user: "It agreed to a refund, then two turns later denied the policy existed."

Single-turn failures are visible. The request comes in, the model hallucinates or refuses, your eval catches it, you fix the prompt. The feedback loop is tight. Multi-turn failures work differently: the session starts fine, degrades gradually turn by turn, and your monitoring never fires because each individual response is technically coherent. The problem is the session as a whole — and almost no team instruments for that.

Research across major frontier models (Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro) shows an average 39% performance drop when moving from single-turn to multi-turn conversations. That number hides the real story: only about 16% of the drop is capability loss. The other 23 points are a reliability crisis — the gap between a model's best and worst performance on the same task doubles as conversation length grows. You're not just getting worse outputs; you're getting inconsistent ones.

Stateful Conversations at Database Scale: The Session Store Architecture Every Production Chat Feature Needs

· 10 min read
Tian Pan
Software Engineer

Most engineers shipping chat features discover their session architecture is wrong in production, not in design review. The demo ran fine: you tested with five messages, the conversation history fit in memory, and the LLM responded coherently. Then you launched, and somewhere between the first thousand concurrent sessions and the first deployment rollout, users started experiencing forgotten context, partial responses, or conversations that reset without warning. The in-memory pattern that makes chat features trivial to prototype is precisely what makes them fragile to operate.

This is not a subtle architectural mistake. Conversation state is fundamentally different from request state. Request state lives for milliseconds; conversation state must survive pod restarts, horizontal scaling, deployment cycles, and mobile network interruptions — for minutes, hours, or days. Building on the wrong abstraction creates reliability debt that compounds as conversation length grows and user load increases.