Skip to main content

Your AI Chat Transcripts Are Evidence: Retention Design for LLM Products Under Legal Hold

· 11 min read
Tian Pan
Software Engineer

On May 13, 2025, a federal magistrate judge in the Southern District of New York signed a preservation order that replaced a consumer AI company's retention policy with a single word: forever. OpenAI was directed to preserve and segregate every output log across Free, Plus, Pro, and Team tiers — including conversations users had explicitly deleted, including conversations privacy law would otherwise require to be erased. By November, the same court ordered 20 million of those de-identified transcripts produced to the New York Times and co-plaintiffs as sampled discovery. The indefinite retention obligation lasted until September 26 of that year. Five months of "delete" meaning "keep, in a segregated vault, for an opposing party to read later."

That order is the warning shot for every team building on top of LLMs. If your product stores chat, your retention policy is one plausible lawsuit away from being replaced by whatever the court thinks is reasonable. The engineering question is not whether this happens to you. It is whether your storage architecture can absorb it without turning your product into a liability engine for the legal department.

Email retention playbooks do not carry over cleanly. AI conversations contain more than what the user typed, and the "more" is where the discovery fights are starting.

What Counts as "the Conversation" Is the First Hard Question

In email, ESI has clean edges. Sender, recipient, subject, body, attachments, headers. A legal hold instructs the mail server to preserve those tuples. Done.

An AI conversation is a layered artifact. The user prompt is the smallest piece. Surrounding it: the system prompt (which version? authored by whom? changed mid-session?), the conversation history (did the agent summarize earlier turns into a compressed preamble?), retrieved chunks from RAG (whose documents, at what revision, under what license?), tool call arguments and return values (what did the model read, what actions did it take, with what side effects?), model metadata (model ID, parameter version, temperature, seed), and ephemeral context (cached reasoning traces, speculative branches the model explored before committing to a final response).

Courts are already asking for pieces of this you probably do not log. Federal Rule of Civil Procedure 34 requires ESI to be produced "as kept in the usual course of business or in a reasonably usable form," with metadata intact. When a requesting party wants to reconstruct why your agent recommended the response it did, they will ask for the system prompt version, the retrieval set, the tool outputs, and the conversation history in the exact form the model saw them. If your production pipeline composes these on the fly and discards them after inference, you have a preservation problem before the hold letter arrives.

The working definition for a defensible architecture: anything the model saw during a turn that materially shaped the output is part of the record. That includes inputs you treat as infrastructure. System prompts are documents. Retrieved chunks are exhibits. Tool outputs are testimony from third parties. If you pretend they are not, a judge will correct you.

Retention Policy Design That Survives a Subpoena

The architecture that does not survive is the default one most teams ship: a single MongoDB or Postgres collection called messages, one row per turn, with created_at and no retention policy at all, backed up to object storage with a seven-year lifecycle rule someone set and forgot. When a hold arrives, legal asks "what can we preserve, and what can we let expire?" and the answer is "everything or nothing" because the schema has no joints to cut along.

A retention-aware architecture separates four kinds of record by retention tier and mutability:

User-visible conversation, which the user can read, edit, and delete through the product UI, lives in the shortest-retention tier with a per-conversation TTL. The TTL is a product decision (thirty days? ninety? one year?) baked into the row, not a global default. Users who delete a conversation trigger a deletion task on a queue, and the queue is auditable.

System-generated context — system prompts, retrieval results, tool outputs, model metadata — lives in a parallel tier keyed by the same conversation ID, but with its own retention clock and exclusion rules. This is where the "ephemeral context" carve-outs live. If a RAG retrieval fetched twelve chunks and the model cited three, the other nine can have a shorter retention horizon documented in policy. The critical move is that the exclusion is written down, versioned, and evaluated before the hold arrives, not invented during discovery.

Immutable audit logs capture the fact that an action occurred, separate from what the action was. User A deleted conversation X at timestamp T. Tool Y was invoked with hashed argument Z and returned a 200 response. Retention here is long (SOC 2 expects one year minimum; many teams go seven for consistency with financial records), and the log is append-only and tamper-evident — signed, chained, or written to a WORM bucket. This is the log that proves your retention policy was executed, which is the evidence you will want if anyone later alleges spoliation.

Derived artifacts — summaries, embeddings, analytics rollups — often need separate retention rules because they can survive after the source is deleted and leak facts that were supposed to be erased. A conversation embedding that persists after the conversation is deleted is a ghost that keeps answering questions about the deceased.

This separation is what lets a legal hold say "freeze user-visible conversation for these 400 users, but let the ephemeral retrieval context age out on its normal schedule" — a sentence that is impossible to execute if everything lives in one table.

Legal hold says: stop the clock. When litigation is reasonably anticipated, the duty to preserve attaches, and continuing to run your normal deletion schedule becomes spoliation. FRCP Rule 37(e) authorizes sanctions when a party fails to take reasonable steps to preserve ESI that should have been preserved, with harsher remedies — adverse inference, dismissal — if the court finds intent to deprive. The NYT case against OpenAI is the headline example, but the doctrine predates it by a decade.

GDPR Article 17 says: start the clock, and make it short. Users in the EU have the right to erasure, and the controller must comply without undue delay. Similar provisions exist in the CCPA, Brazil's LGPD, and a growing list of state and national laws.

These do not reconcile. A U.S. court can order preservation of data about an EU resident, and that order can conflict head-on with a deletion obligation the controller has under GDPR. The Southern District of New York's May 2025 order explicitly acknowledged that preservation might require ignoring privacy laws "around the world" and ordered preservation anyway. Compliance with the U.S. order arguably breaches GDPR Article 5 and 17; compliance with GDPR risks U.S. spoliation sanctions. EU counsel and U.S. counsel give opposite instructions and they are both right under their own law.

The architecture that handles this is the legal-hold registry — a separate service, not a flag on the user row — that knows which users, conversations, accounts, or topics are on hold, why, by whom, and when the hold ends. When the deletion queue tries to process a user's erasure request, it queries the registry first. A match does not cancel the erasure; it routes it. The data is moved to a segregated hold-store (often encrypted with keys the application cannot access), deletion is logged as "suspended pending legal hold," and the user is notified of the suspension to the extent the hold allows.

The registry exists so that when GDPR enforcement asks "why did you not delete this user's data within thirty days?", the answer is "here is the signed hold order, here is the routing decision, here is the segregation attestation." That answer is defensible. "We forgot, there was a lawsuit" is not.

Cross-Tenant Contamination Makes Every Transcript a Joint Record

Multi-tenant AI products have a problem email never had: one user's transcript can contain another user's regulated data because the agent fetched it mid-turn. A customer support agent answering user A's question retrieves a knowledge-base article that was originally authored to describe a complaint from user B. An internal assistant pulls a Slack thread that contains PHI from a third party. A coding agent reads a repository that includes a vendor's proprietary API keys, and those tokens end up in the conversation log.

Under discovery, that transcript is one exhibit. Under privacy law, it is three or four different data subjects whose rights point in different directions. User A's deletion request cannot strip user B's data from the stored record without also destroying evidence of what user A was told. User B's access request may or may not reach into user A's conversation depending on how controllers-and-processors roles are drawn.

Practical responses start at ingestion. Tag retrieved content with its original provenance — document ID, authoring tenant, classification — at the moment it enters the context, and propagate those tags through the stored conversation record. When a deletion request arrives, the tags tell you whether the affected spans can be redacted in place (replaced with a "[redacted per retention policy]" marker plus an audit entry) or whether the whole turn has to be held. Retaining provenance makes redaction tractable; discarding it means every deletion request is a full rebuild of the record.

The same tagging discipline lets you answer the regulator's version of the same question. When an EU data protection authority asks "which users' data is in this ChatGPT-style product?", you answer it against the provenance tags, not by grepping conversation bodies.

"We Just Store Everything in S3" Is a Policy, Just Not a Good One

Teams that have not made retention decisions have made one by default. Infinite retention is a policy; it is the one that maximizes every form of downside risk. Subpoena exposure scales linearly with retained volume — every day you keep a transcript is a day a plaintiff can subpoena it. Privacy violation risk compounds — the longer you hold regulated data, the more cross-jurisdictional exposures accumulate and the harder granular deletion becomes. Breach blast radius grows with every unarchived row — the answer to "how bad would a data breach be" is arithmetic.

The discipline that inverts this is to treat retention as a product requirement, not a compliance afterthought. Every new feature that writes to storage answers three questions in the design doc: what is the retention TTL, what is the audit retention, what is the deletion workflow. Reviewers push back on "TBD" the same way they push back on "no tests" or "no metrics." A feature that cannot answer these questions is not ready to ship.

The inverse discipline is equally important: do not let retention become a silent "short" either. Ephemeral retrieval context that the policy marks as "delete after 24 hours" still has to be preserved when a hold attaches, which means the 24-hour timer has to be suspendable, and the suspension has to be logged. A retention policy that cannot be paused is not a retention policy, it is a speed run through an evidence-destruction liability.

None of this is theoretical anymore. The OpenAI preservation order, the 20-million-record production, and the growing split on whether LLM interactions are privileged work product or plain ESI make it clear that every AI product of meaningful size is going to field at least one of these demands. The products that absorb the demand without existential damage will be the ones whose storage architecture was designed with the demand in mind — tiered by record type, separated by mutability, bound by a per-conversation TTL that a hold registry can freeze and thaw. The products that did not do that work will spend a quarter doing it reactively under a preservation order, which is the most expensive way to do it.

The Takeaway

Chat is not ephemeral anymore. Every conversation your product saves is potentially discoverable, and the AI-specific pieces — system prompts, retrieval results, tool outputs, model metadata — are part of the record whether you are treating them that way or not. The architecture decisions that matter are boring ones: separate your storage tiers by retention class, build a real legal-hold registry before you need it, tag provenance at ingestion, make TTL a product requirement, and log the execution of your retention policy in an immutable audit trail. Do this while the cost is a sprint or two. If you wait until a court order forces the design, the same work will cost a quarter and come with an opposing counsel reading your customers' conversations while you build it.

References:Let's stay in touch and Follow me for more thoughts and updates