Skip to main content

The conversation_id Collision That Swapped Two Users' Contexts at the Gateway

· 10 min read
Tian Pan
Software Engineer

A customer support ticket arrives that reads like a hallucination. The user attached a screenshot: a question they never asked, with their account name at the top, followed by a model response that references files they have never uploaded. The trace looks clean. The model did exactly what was asked of it. The problem is that the question came from a different tenant entirely, and your gateway routed two conversations to the same backend state because their conversation_id values collided.

You do the math on a napkin. UUID v4 has 122 bits of entropy. The birthday-bound probability of any collision in a 50-million-conversation corpus is somewhere south of one in fifty million. You ran the calculation a year ago when you designed the system. The math was correct. The math is still correct. What changed is that two of your backend tiers stopped generating IDs the same way, and the probability the math described was never the probability you were actually running on.

This is the failure mode that ID schemes hide most aggressively: each tier is individually correct, the global system is wrong, and the gap is invisible until the day a user sees another user's data. The fix is not a better generator. It is a different way of thinking about where the ID contract lives.

The probability you computed was for the system you used to have

The birthday bound assumes a single generator drawing from a single distribution. The moment you have two backend tiers — each with its own generator, its own randomness source, its own version of the UUID library — you have two distributions, and the joint collision domain is no longer described by the math you ran on a single tier.

The drift modes that show up in production are almost always boring. One tier upgraded its UUID library from a version that called getrandom(2) to a version that lazily fell back to a userspace PRNG when the syscall returned EAGAIN under load. Another tier ran in a container whose base image seeded its CSPRNG once at boot and inherited that seed into every forked worker because the entropy pool was warming up later than the application started. A third tier "improved" performance by caching a request-scoped UUID generator that turned out to draw from a 32-bit space when it was supposed to draw from 122.

Every one of those changes was reviewed locally, passed its own tests, and shipped without incident on the tier that owned it. None of them was reviewed against the joint distribution. The day you discover that one-in-fifty-million has become one-in-fifty-thousand is the day the gateway routes two live conversations to the same backend record.

The lesson is not "use a better RNG." It is that the collision properties of an ID scheme are an emergent property of the composition of all generators that mint into that namespace, and composition is not a property any single tier can audit.

The gateway is the place where the composition is observed

Your gateway is the only component in the path that sees IDs from every tier on its way to backend state. It is also, almost universally, the component that does the least validation of those IDs. The default pattern is to hash the conversation_id to pick a backend pod, route the request, and let the backend resolve the ID against its store. If two tiers minted the same ID into different backend records, the gateway routes the request to whichever record happens to live on the pod the hash selects. The other record becomes invisible. The user whose conversation got served is somebody else.

Treat the gateway as the enforcement point for the namespace contract. A conversation_id that resolves to two different backend records is a routing error, not a backend error, because it is the gateway that has the global view. The validation that catches this is not subtle. On every request, check whether the ID exists in more than one backing store; if it does, refuse to route and page the on-call. The check is cheap when collisions are rare, which is the operating regime you expect; the alert is loud when collisions are not rare, which is the regime you need to catch.

The objection that this adds a hop to every request is real and the answer is that you can sample. A 1% sample on a steady-state traffic shape will find a collision rate of one-in-fifty-thousand inside a minute, which is the latency you need to convert "users are reading other users' data" from a multi-hour support escalation into a paging-grade incident.

Per-tenant prefixes change the blast radius

Even if your generator is perfect, the collision blast radius is set by the namespace structure, not by the entropy of any one ID. A flat conversation_id namespace means that any collision is potentially cross-tenant; a tenant_id:conversation_id namespace means that a collision can only happen within a tenant, which converts a security incident into a (still bad, but containable) consistency bug.

The architectural move is to treat tenant isolation as a property of the ID's structure, not of the application logic that consumes the ID. If your IDs carry their tenant in their prefix, then the question "could this ID belong to another tenant" has a single answer at the namespace level, and the application code that handles them can be written without the constant background fear that tenant-checking is a discipline rather than a guarantee. Every place that handles an ID is a place where the tenant check could be forgotten; making the tenant a structural property of the ID removes most of those places.

The objection that this leaks tenant identifiers into URLs and logs is also real, and the answer is that for an agentic product the tenant boundary is already the most important fact about every operation. If you cannot tell from a request which tenant it belongs to, you cannot audit, you cannot rate-limit, you cannot bill, and you cannot investigate cross-tenant incidents. The tenant is going to be in your logs no matter what. Putting it in the ID is the cheapest way to make sure it is in the ID consistently.

A single allocation authority makes composition reasonable

The deeper fix for the multi-tier drift problem is to stop letting each tier mint IDs into a shared namespace. Run a single ID-allocation service that owns the namespace and exposes a generation API. Tiers call it. The tiers do not have UUID libraries. They have an HTTP client that asks the allocator for the next ID for a given tenant, gets back a string, and uses it. The allocator's RNG is the only RNG in the system. The allocator's library version is the only library version in the system. The allocator's audit is the only audit you need.

The cost is one network hop on conversation creation, which is almost never on the hot path. The benefit is that your collision properties are now a property of one well-understood component rather than an emergent property of every tier's local choices. When you upgrade the allocator's generator from UUID v4 to UUID v7, you upgrade it everywhere, at once, with a single review. When you discover that a generator had a bad year, you have one place to investigate and one place to fix.

This is the same pattern your payments system already uses. Nobody on the payments team generates a transaction ID inside a checkout service. There is a transaction-allocation authority, it has its own redundancy and audit, and every service that needs a transaction ID asks it for one. The reason payments works this way is that the cross-transaction failure mode of two independently-correct generators is unacceptable, and the cost of a network hop on the rare creation path is trivial compared to the cost of an audit gap. Agentic products have the same cross-conversation failure mode and have not yet adopted the same discipline.

Conversation IDs are payment IDs in everything but name

The argument that ties the rest of this together is short. In an agentic product, the conversation_id is the load-bearing primary key of the user's relationship with the model. It indexes the user's memory. It indexes the user's billing. It indexes the audit trail. It is the key on which cross-tenant isolation depends. Every meaningful operation in the system either takes a conversation_id as input or produces one as output. The discipline that surrounds it should be at least equal to the discipline that surrounds a transaction ID.

The mismatch most teams have is that conversation IDs got their identifier scheme decided early, by a single engineer, in the first sprint, when there was one backend and the question of how IDs would compose across tiers was not a question yet. Three years later there are four tiers, the original engineer is on a different team, and the ID scheme is load-bearing for a product that was not the product when the scheme was designed. The rigor never caught up because nobody scheduled the catch-up.

What good looks like, concretely:

  • A single allocator owns the conversation namespace and is the only thing in production that mints IDs into it.
  • Every ID carries a tenant prefix that bounds the cross-tenant blast radius even if the random suffix collides.
  • The gateway samples or fully validates that incoming IDs resolve to exactly one backend record and pages on duplicates.
  • The ID scheme has a security review on the same cadence as the payments scheme, asking explicitly "what is the cross-tenant failure mode if any single component drifts."
  • The team that owns the allocator has a runbook for "the generator looks like it regressed," and the runbook has been practiced.

None of those are exotic. Three of them are imported wholesale from payments engineering. The reason they have not propagated into agentic stacks is that "conversation_id" sounds like a chat-history detail rather than a primary key, and the teams treating it like a detail are exactly the teams who will write the incident report.

The discipline transfer is the work

Treat the conversation_id as the primary key it actually is. Stop letting each tier mint into a shared namespace. Make the gateway validate the composition. Put the tenant in the ID so isolation is a structural property and not a developer-discipline property. Run the security review the payments team has been running for thirty years.

The math on UUID collisions is correct. It is correct for a system you are not running. The system you are running is a composition of generators, and its collision properties are a property of the composition, and the composition is what you should be auditing. The day you treat the conversation namespace with the rigor your payments team applies to transactions is the day this class of incident stops appearing in your weekly review under "investigating."

References:Let's stay in touch and Follow me for more thoughts and updates