Skip to main content

Chat History Is a Database. Stop Treating It Like Scrollback.

· 11 min read
Tian Pan
Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

Then production happens. A user has a long support session. Another user comes back the next day expecting continuity. A third user asks the agent to summarize a thread that's been going for two hours. Suddenly the agent contradicts itself, drops a constraint the user set six turns ago, or burns through the context budget on history alone and has nothing left for the actual answer. The team raises the truncation limit, costs spike, and somebody points out that doubling the context length quadruples the attention compute — even on providers where the bill scales linearly per token, the model's ability to use tokens in the middle of the context decays fast enough that more bytes does not buy more memory.

The cost shape nobody draws on the whiteboard

The first thing scrollback hides is what your tokens are actually buying. In a long conversation, the dominant line item on the bill is not the model's output. It is not even the system prompt. It is the conversation history, re-sent on every turn, retransmitted in full because the API is stateless and the client is splicing the same turns into the prompt over and over.

The cost is roughly quadratic in conversation length without prompt caching. By turn thirty, the early messages have been re-shipped to the provider thirty times. Caching pulls the constant down — Claude's prompt caching, OpenAI's automatic prompt cache — but it does not change the shape, because the cache only helps when the prefix is exactly stable, and any tool call result, any retrieved document, any timestamp injected into the prompt invalidates the suffix and forces the cache to re-warm. Teams that have not measured this tend to assume their token bill scales with output volume. It does not. It scales with the integral of conversation length, and history re-reads are usually two-thirds of the total.

This matters because it inverts the optimization target. The cheapest token is the one you don't put in the prompt. Every architectural choice in the rest of this post is, fundamentally, a question of which slice of the past to load, when, and at what fidelity — and the brute-force answer of "all of it, every turn" is only viable when conversations are short.

What changes when you call it a database

Once you stop calling it a transcript and start calling it a database, the problem space rearranges itself around questions you already know how to answer. Databases have indexes, query plans, materialized views, eviction policies, and consistency models. Every one of those concepts has an analog in chat history, and every one of them is being implicitly handled — usually badly — in any system that "just appends and renders."

A turn index is the first thing you build. It does not have to be sophisticated; an integer per turn, a timestamp, the role, the intent label produced by a small classifier (question, correction, instruction, tool_result, chitchat), and a content hash is enough to start. With that index in place, you can write queries like "give me the last three turns where the user issued an instruction" without scanning the entire log. Most agents never write a query like that, because the data structure does not support it — they get the tail, in order, regardless of relevance.

Intent-level summaries become a materialized view over the index. Instead of replaying every turn, you carry a running compressed representation: the user is shopping for a flight, they specified Tuesday departure, they ruled out red-eyes, they're price-sensitive but flexible on airline. That summary is a denormalization of the underlying log, refreshed asynchronously on a schedule the application controls. Like any materialized view, it can fall out of sync with the source — the eval discipline below is what keeps you honest about that drift.

Eviction policy is where most teams discover their architecture is broken. Truncating the oldest turns is LRU by arrival, which is wrong: the load-bearing turn is often early in the conversation ("act as a paralegal reviewing a Delaware LLC operating agreement") and gets evicted first, while a fresh round of small-talk from turn 28 stays. A relevance-pinned eviction policy classifies turns as load-bearing (system instructions the user issued, persistent constraints, named entities the user referenced) and protects them from eviction even as their position recedes. The rest of the conversation is fair game, with eviction biased toward the middle, where the model is least likely to attend anyway thanks to the well-documented lost-in-the-middle effect.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates