Skip to main content

The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory

· 11 min read
Tian Pan
Software Engineer

Three teams are looking at the same event stream, each with a column called session_id, and each with a different definition of what a session is. Billing inherited a 30-minute idle window from the auth library. Eval inherited "everything until the user says 'bye' or stops typing for 10 minutes" from a chatbot framework. Memory uses a thread ID that the UI generates whenever the user clicks "New chat" — which most users never do. Three columns, three semantics, one rolled-up dashboard, three unrelated bugs that share a root cause.

This is the session boundary problem. It looks like an instrumentation nit, but it is actually a product question wearing infrastructure clothes: where does a conversation end? The honest answer is that there is no single answer — a session for billing is not the same object as a session for eval is not the same object as a session for memory — and a team that picks one default and lets the other two inherit it is shipping a billing dispute, an eval bias, and a memory leak with the same root cause.

Three definitions hiding in one column

A "session" in production telemetry is overloaded the way "user" is overloaded — everyone thinks they know what it means, and everyone is using a slightly different definition than their neighbor. The three callers that matter most each want something different.

Billing wants a unit of cost that maps to a unit of value. For a per-conversation pricing tier (some support platforms charge $0.49–$0.99 per resolution), the boundary has to align with what the customer perceives as one interaction — otherwise a user who closes the tab to grab coffee and comes back fifteen minutes later either gets billed twice for one task, or stays inside a "session" that never ends and undercounts revenue. For a per-token pricing tier passed through to the customer, the boundary determines which inputs get charged against which budget, which agent owns the cost, and how the monthly invoice gets sliced.

Eval wants a unit of task completion. The most informative multi-turn metric is conversation completeness — did the user's intent get fulfilled — and that requires the eval pipeline to know which turns belong to which intent. If a user spends the morning planning a trip, breaks for lunch, and comes back to ask about something unrelated, the eval pipeline that lumps both into one "session" will mark a successful trip-planning interaction as incomplete because the unrelated follow-up never reached resolution. Research from 2025 found that frontier models drop ~39% in performance from single-turn to multi-turn settings — measure that drop against a session boundary that doesn't match the actual task and the number is meaningless.

Memory wants a unit of relevance. The framing here is what does the model recall when the user asks about "the project we discussed last week" — and the answer depends on whether memory is scoped per-conversation (the user has to be in the same thread), per-actor (the user across all threads), or per-topic (a clustering across actors and threads). All three are valid choices, but each requires a different boundary, and a system that forces all three down to one ID per request is either over-recalling (leaking unrelated context into prompts and inflating tokens) or under-recalling (the model "forgets" what the user expects it to know).

The same event stream has to support all three. The trap is assuming one ID can do all three jobs.

The 30-minute timeout is a default, not a decision

Most production chatbots inherited their session boundary from a chat platform's idle timeout — typically 30 minutes, sometimes configurable, almost never explicitly chosen by the AI team. The default came from a different era, when "session" meant "the user is at their keyboard right now" and a 30-minute gap was a strong signal they had walked away. For an LLM-backed product, that signal does not carry the same meaning.

A user planning a multi-day trip with an AI travel agent might idle for an hour while comparing flights in another tab. A user pair-programming with a coding assistant might step away for a 90-minute meeting and return to the same task. A user drafting a contract might spread the work across three days. In each case, the user's mental model is "we are still working on the same thing" — and the system's mental model, anchored to a 30-minute idle counter, has already split that into two or three sessions.

The first symptom is usually billing: enterprise customers on long-form workflows complain that they are being charged for "new conversations" that they perceived as continuations, or — the inverse — that one billing period spans what they thought were three separate engagements. The second symptom is eval drift: a task-completion grader scores a session as incomplete because the user came back the next day to finish the request, and the grader only sees the abandoned half. The third symptom is memory bleed: the model recalls (or fails to recall) details across what the user thinks is a fresh start, and the grouping the recall mechanism is using does not match the user's expectation.

A "new conversation" button does not save you. Most users never press it. The platforms that ship one find that the button is used by under 5% of active users, mostly engineers and power users who already understand what it controls. The other 95% close the tab.

The user-perceived vs system-modeled gap

Before designing a fix, instrument the gap. The cheapest diagnostic is a side-by-side ledger: for each event stream, record the system-modeled session_id (whatever the platform default is) and the user-perceived session boundary (inferred from intent shifts, topic changes, or explicit "new chat" actions). Then compute the divergence.

You will find three patterns. First, system-collapses-user-splits: the platform groups what the user perceives as multiple separate tasks into one long session. Common when the user is chatting in bursts and the idle counter never trips. Second, system-splits-user-collapses: the platform fragments what the user perceives as one task into multiple sessions. Common when the user steps away for longer than the idle window. Third, drift: the platform's grouping aligns at the start but diverges as the conversation evolves — the user mentally "starts a new topic" three times within one platform-session, and none of those subjective resets are visible in the data.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates