Session Stitching: Why Your Conversation-ID Is a Lie
A user starts negotiating a contract with your agent on her desktop at 9 a.m. She gets a Slack ping, switches to her phone over lunch to ask one clarifying question, and reopens the desktop tab at 4 p.m. to revise the draft. To her, that was one task — three hours of working through one contract. To your system, that was three sessions on two devices, each with its own conversation-id, each with its own memory window, each presenting a fresh greeting and asking her to re-paste the draft she'd already discussed twice.
The bug is not in the model. The bug is that your platform encoded "session" — a transport-layer artifact about a single connection — as the unit of context, while your user encoded "task" — the contract — as the unit of context. Every framework on the market quietly conflates the two, and the gap between them is where half of agent UX disappears.
This is not a niche complaint. Once you start logging task-level traces, you find that a meaningful share of "new conversations" are actually continuations of unfinished tasks — the user gave up on stitching them by hand and started over. The product KPI you call "engagement" partly measures users paying the cost of your missing abstraction.
The Frameworks Hand You a Session-ID and Call It Memory
Open any agent SDK from the last two years and the persistence story rhymes. LangGraph asks you to pass a thread_id; the checkpointer saves graph state per thread, separate from every other thread, and resumes when you re-invoke with the same id. The OpenAI Agents SDK gives you a Session keyed by a session id (SQLiteSession("conversation_123")); same session, full history; new session, blank slate. The Claude Agent SDK persists a session to disk so you can return to it later. Google's ADK ships a Resume feature that picks up an interrupted workflow run.
These primitives are correct at the layer they target — they reliably persist state for a single logical run. But none of them defines what a "logical run" is from the user's point of view, and the default that ships in tutorials is: one session per WebSocket connect, one thread per browser tab, one resume per process. The platform decides where the boundary falls, and the boundary almost always lands in the wrong place.
When you trace a real user, the seams are not where the framework draws them. The user does not switch tasks because their TLS connection idled out. They switch tasks when they finish negotiating the contract and start onboarding a vendor. The transport-layer session-id and the user's mental task boundary are two different things that happen to coincide some of the time, and your product is staking continuity on a coincidence.
Task-ID Has To Be First-Class — And Orthogonal
The fix is not to make sessions live longer. Long sessions accumulate unrelated history, and if your model upgrade later trims context, you cannot decide what to keep, because you never recorded what belongs to what.
The fix is to introduce a task_id that is user-meaningful and orthogonal to session_id. Sessions still exist — they describe a connection, a device, a process lifetime — but they are no longer the unit of memory. A task is the unit of memory, and a session is just one slice of activity that happened to belong to a task.
Concretely:
- A task has a name the user understands ("Acme contract revision," not "conv-7e3f"). The user names it explicitly, or your agent proposes a name from the first turn and lets the user accept or rename.
- A task has a working set: the documents, decisions, open questions, and partial outputs that constitute its in-progress state. This working set is durable across sessions and is the thing your resumption summary is built from.
- A session has a
task_idforeign key. When a user reconnects on a new device, the platform looks up open tasks for that user, surfaces them, and asks which one this session continues — or lets the user start a new task explicitly. - The agent's continuation UX is "you were working on Acme contract revision — pick up where you left off?" with a one-paragraph summary of state. It is not a chronological scroll of "yesterday's chat" that forces the user to re-derive what they were doing.
ChatGPT Projects is the closest mainstream pattern that gets this shape right: a project is a durable container, conversations live inside it, and the user can switch devices without losing where they are. It is not perfect, but it demonstrates that a task abstraction above sessions is feasible at consumer scale.
The point is not that everyone needs a Projects clone. The point is that your data model needs an explicit join between sessions and tasks, owned by your team, exposed to your eval suite, and visible in your telemetry. If you do not have that join, you do not have task continuity — you have hope.
Cross-Device Continuity Is an Authorization Problem in Disguise
The moment you commit to durable, cross-device tasks, the auth model you built for ephemeral sessions stops generalizing.
A short-lived session-bound credential is fine when the user, the device, and the conversation all live and die together. The contract is: this token authorizes this connection; lose the connection, lose the token. But a task that survives across devices needs a different contract: this token authorizes resuming this task from a different device, possibly without the original cookie, possibly after a reboot, possibly while the user is also actively connected from another device.
Recent work on cross-device flows has pushed this into a concrete protocol space. The IETF's Cross-Device Flows BCP draft formalizes session-transfer flows where a user authorizes the transfer on an Authorization Device and then consumes the session from a Consumption Device, with state preserved across the boundary. Device-Bound Session Credentials (DBSC) bind a session cookie to a device-held private key so it cannot be silently lifted. Both of these matter because they tell you that your cross-device resumption is going to negotiate with the same constraints that the auth-protocol world has been wrestling with — only now you are negotiating them on top of a durable task store full of in-progress agent state, including draft contracts and partially executed tool calls.
Two practical consequences:
- Tasks need an access policy distinct from session auth. Who can resume a task? Only the original user? A delegated teammate? On any device, or only on enrolled devices? You will be asked these questions; if you have not modeled them, the safe default is "deny," and your continuity disappears.
- Concurrent sessions on the same task become a real concern. If desktop and phone both have the task open, whose edits win, what does the agent do with two streams of input, and how do you reconcile divergent tool calls? The naive answer is "last writer wins"; the user-trust answer is closer to OT or CRDT semantics on the working set, which is more engineering than most teams plan for. Even acknowledging the conflict in the UX ("your phone is also editing this — do you want to take over?") beats letting the model produce two divergent revisions.
Evaluate Task-Continuity, Not Turn-Quality
Eval suites for agents have spent the last two years optimizing for turn-quality and single-session task completion. The metrics look right; the production failures still happen, because the failures live in the gaps between sessions, not within them.
Recent benchmarks are starting to expose this. MemoryArena uses curated multi-session tasks with interdependent subtasks, where information introduced in an early session imposes latent constraints on a later one — exactly the situation a real user creates by scattering context across devices. Reported success rates of state-of-the-art agent memory methods on these tasks are unflatteringly low, even when the same models perform well on single-session benchmarks. MemoryAgentBench (ICLR 2026) takes a similar "inject once, query multiple times" stance and surfaces failures in long-range understanding and conflict resolution that flat-history evals never catch.
The eval discipline a real product needs:
- Synthetic multi-session traces that include device switches, idle gaps of hours and days, and at least one piece of context introduced before the gap that must be recalled after it.
- A scoring function that grades the resumption summary the agent produces, not just the final answer. If the summary tells the user "you were working on the Acme contract" but silently drops the constraint about the indemnity clause, the agent has failed even if the next turn happens to be correct.
- Regression tests that fire on model upgrades and re-run the multi-session traces end-to-end. A model swap that improves single-turn quality by two points but reduces resumption-summary fidelity by ten is a regression you ship into production with no warning, because your eval suite never measured it.
- A failure taxonomy that distinguishes within-session reasoning errors from cross-session memory errors. The fixes are different — one points at the prompt, the other points at retrieval and summarization.
If your evals only score single-session task completion, you are flying blind on the half of the user experience where the framework abandoned you.
The Org Failure Mode: A Workflow Product Built on a Chat Product
The most common org-level failure here is invisible. The product team thinks they shipped a chat product; the user thinks they bought a workflow product. The user lays a workflow on top of the chat surface — long-running tasks, multi-day arcs, cross-device handoff — and the team only learns about it from churn interviews after a model upgrade or a session-id change wipes a customer's working set.
A few signs that this is happening to you:
- Your support queue has a recurring shape of "I lost my conversation when I switched to my phone." Volume probably went up the last time anyone changed how sessions are minted.
- Power users send screenshots of long chats with a personal naming convention in the first message, because they have invented a
task_idyour platform refuses to provide. - Your top quartile of users by session length has a near-flat retention curve and the bottom three quartiles do not — the workflow users figured out a workaround, the casual users gave up.
- Engineers periodically debate "should we just make sessions never expire?" The fact that the question keeps coming back is the symptom; the disease is that session is the wrong primitive.
The architectural realization is unglamorous. A session is an implementation detail your transport layer needs to keep track of connections. A task is the unit your user actually cares about. Conflating them ships a UX that throws away half the user's work the moment they switch tabs, and the framework you reached for will not flag this as a bug because the framework's contract ends at the session boundary.
Build the Stitch Before You Need It
The pragmatic order of operations:
- Add a
task_idcolumn to your persistence layer this quarter, even if no UI surfaces it yet. Backfill it from session_id one-to-one to start. The cost is small; the option value is the foundation for everything else. - Instrument: log when a user starts a new session within N hours of an old one ending. You are already doing this for security; do it for product. The volume tells you how much task continuity you owe your users.
- Ship a minimal continuation UX — "resume this task?" with an agent-generated one-paragraph summary — behind a flag for power users. Watch what they rename, what they abandon, and what they correct in the summary. That feedback is your eval set.
- Add cross-device resumption only after you have an auth model for it. Do not retrofit it onto session cookies; you will regret it the first time a token leaks.
- Add cross-session evals to your release gate before you next swap models. The cost of writing them is bounded; the cost of a silent regression in resumption-summary quality is not.
The teams that get this right will not advertise it. The UX will simply feel less hostile — the agent will know the user came back, will not ask for the contract draft a third time, and will not silently pretend the morning's conversation never happened. The teams that do not get this right will keep shipping features into the chat surface and wondering why retention does not move.
Either way, the user has already decided what the unit of work is. The only question is whether your platform admits it.
- https://docs.langchain.com/oss/python/langgraph/persistence
- https://openai.github.io/openai-agents-python/sessions/
- https://platform.claude.com/docs/en/agent-sdk/sessions
- https://help.openai.com/en/articles/10169521-using-projects-in-chatgpt
- https://openai.com/index/memory-and-new-controls-for-chatgpt/
- https://datatracker.ietf.org/doc/html/draft-ietf-oauth-cross-device-security-15
- https://w3c.github.io/webappsec-dbsc/
- https://arxiv.org/html/2602.16313
- https://arxiv.org/html/2507.05257v1
- https://google.github.io/adk-docs/runtime/resume/
- https://temporal.io/blog/building-a-persistent-conversational-ai-chatbot-with-temporal
- https://www.langchain.com/conceptual-guides/runtime-behind-production-deep-agents
