Your AI Chat Transcripts Are Evidence: Retention Design for LLM Products Under Legal Hold

April 23, 2026 · 11 min read

Software Engineer

On May 13, 2025, a federal magistrate judge in the Southern District of New York signed a preservation order that replaced a consumer AI company's retention policy with a single word: forever. OpenAI was directed to preserve and segregate every output log across Free, Plus, Pro, and Team tiers — including conversations users had explicitly deleted, including conversations privacy law would otherwise require to be erased. By November, the same court ordered 20 million of those de-identified transcripts produced to the New York Times and co-plaintiffs as sampled discovery. The indefinite retention obligation lasted until September 26 of that year. Five months of "delete" meaning "keep, in a segregated vault, for an opposing party to read later."

That order is the warning shot for every team building on top of LLMs. If your product stores chat, your retention policy is one plausible lawsuit away from being replaced by whatever the court thinks is reasonable. The engineering question is not whether this happens to you. It is whether your storage architecture can absorb it without turning your product into a liability engine for the legal department.

Email retention playbooks do not carry over cleanly. AI conversations contain more than what the user typed, and the "more" is where the discovery fights are starting.

What Counts as "the Conversation" Is the First Hard Question

In email, ESI has clean edges. Sender, recipient, subject, body, attachments, headers. A legal hold instructs the mail server to preserve those tuples. Done.

An AI conversation is a layered artifact. The user prompt is the smallest piece. Surrounding it: the system prompt (which version? authored by whom? changed mid-session?), the conversation history (did the agent summarize earlier turns into a compressed preamble?), retrieved chunks from RAG (whose documents, at what revision, under what license?), tool call arguments and return values (what did the model read, what actions did it take, with what side effects?), model metadata (model ID, parameter version, temperature, seed), and ephemeral context (cached reasoning traces, speculative branches the model explored before committing to a final response).

Courts are already asking for pieces of this you probably do not log. Federal Rule of Civil Procedure 34 requires ESI to be produced "as kept in the usual course of business or in a reasonably usable form," with metadata intact. When a requesting party wants to reconstruct why your agent recommended the response it did, they will ask for the system prompt version, the retrieval set, the tool outputs, and the conversation history in the exact form the model saw them. If your production pipeline composes these on the fly and discards them after inference, you have a preservation problem before the hold letter arrives.

The working definition for a defensible architecture: anything the model saw during a turn that materially shaped the output is part of the record. That includes inputs you treat as infrastructure. System prompts are documents. Retrieved chunks are exhibits. Tool outputs are testimony from third parties. If you pretend they are not, a judge will correct you.

Retention Policy Design That Survives a Subpoena

The architecture that does not survive is the default one most teams ship: a single MongoDB or Postgres collection called messages, one row per turn, with created_at and no retention policy at all, backed up to object storage with a seven-year lifecycle rule someone set and forgot. When a hold arrives, legal asks "what can we preserve, and what can we let expire?" and the answer is "everything or nothing" because the schema has no joints to cut along.

A retention-aware architecture separates four kinds of record by retention tier and mutability:

User-visible conversation, which the user can read, edit, and delete through the product UI, lives in the shortest-retention tier with a per-conversation TTL. The TTL is a product decision (thirty days? ninety? one year?) baked into the row, not a global default. Users who delete a conversation trigger a deletion task on a queue, and the queue is auditable.

System-generated context — system prompts, retrieval results, tool outputs, model metadata — lives in a parallel tier keyed by the same conversation ID, but with its own retention clock and exclusion rules. This is where the "ephemeral context" carve-outs live. If a RAG retrieval fetched twelve chunks and the model cited three, the other nine can have a shorter retention horizon documented in policy. The critical move is that the exclusion is written down, versioned, and evaluated before the hold arrives, not invented during discovery.

Immutable audit logs capture the fact that an action occurred, separate from what the action was. User A deleted conversation X at timestamp T. Tool Y was invoked with hashed argument Z and returned a 200 response. Retention here is long (SOC 2 expects one year minimum; many teams go seven for consistency with financial records), and the log is append-only and tamper-evident — signed, chained, or written to a WORM bucket. This is the log that proves your retention policy was executed, which is the evidence you will want if anyone later alleges spoliation.

Derived artifacts — summaries, embeddings, analytics rollups — often need separate retention rules because they can survive after the source is deleted and leak facts that were supposed to be erased. A conversation embedding that persists after the conversation is deleted is a ghost that keeps answering questions about the deceased.

This separation is what lets a legal hold say "freeze user-visible conversation for these 400 users, but let the ephemeral retrieval context age out on its normal schedule" — a sentence that is impossible to execute if everything lives in one table.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Your AI Chat Transcripts Are Evidence: Retention Design for LLM Products Under Legal Hold

What Counts as "the Conversation" Is the First Hard Question

Retention Policy Design That Survives a Subpoena

Recommended Reading

About Tian Pan

What Counts as "the Conversation" Is the First Hard Question​

Retention Policy Design That Survives a Subpoena​

Recommended Reading

About Tian Pan

What Counts as "the Conversation" Is the First Hard Question

Retention Policy Design That Survives a Subpoena