Chat History Is a Database. Stop Treating It Like Scrollback.
The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.
The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.
Then production happens. A user has a long support session. Another user comes back the next day expecting continuity. A third user asks the agent to summarize a thread that's been going for two hours. Suddenly the agent contradicts itself, drops a constraint the user set six turns ago, or burns through the context budget on history alone and has nothing left for the actual answer. The team raises the truncation limit, costs spike, and somebody points out that doubling the context length quadruples the attention compute — even on providers where the bill scales linearly per token, the model's ability to use tokens in the middle of the context decays fast enough that more bytes does not buy more memory.
The cost shape nobody draws on the whiteboard
The first thing scrollback hides is what your tokens are actually buying. In a long conversation, the dominant line item on the bill is not the model's output. It is not even the system prompt. It is the conversation history, re-sent on every turn, retransmitted in full because the API is stateless and the client is splicing the same turns into the prompt over and over.
The cost is roughly quadratic in conversation length without prompt caching. By turn thirty, the early messages have been re-shipped to the provider thirty times. Caching pulls the constant down — Claude's prompt caching, OpenAI's automatic prompt cache — but it does not change the shape, because the cache only helps when the prefix is exactly stable, and any tool call result, any retrieved document, any timestamp injected into the prompt invalidates the suffix and forces the cache to re-warm. Teams that have not measured this tend to assume their token bill scales with output volume. It does not. It scales with the integral of conversation length, and history re-reads are usually two-thirds of the total.
This matters because it inverts the optimization target. The cheapest token is the one you don't put in the prompt. Every architectural choice in the rest of this post is, fundamentally, a question of which slice of the past to load, when, and at what fidelity — and the brute-force answer of "all of it, every turn" is only viable when conversations are short.
What changes when you call it a database
Once you stop calling it a transcript and start calling it a database, the problem space rearranges itself around questions you already know how to answer. Databases have indexes, query plans, materialized views, eviction policies, and consistency models. Every one of those concepts has an analog in chat history, and every one of them is being implicitly handled — usually badly — in any system that "just appends and renders."
A turn index is the first thing you build. It does not have to be sophisticated; an integer per turn, a timestamp, the role, the intent label produced by a small classifier (question, correction, instruction, tool_result, chitchat), and a content hash is enough to start. With that index in place, you can write queries like "give me the last three turns where the user issued an instruction" without scanning the entire log. Most agents never write a query like that, because the data structure does not support it — they get the tail, in order, regardless of relevance.
Intent-level summaries become a materialized view over the index. Instead of replaying every turn, you carry a running compressed representation: the user is shopping for a flight, they specified Tuesday departure, they ruled out red-eyes, they're price-sensitive but flexible on airline. That summary is a denormalization of the underlying log, refreshed asynchronously on a schedule the application controls. Like any materialized view, it can fall out of sync with the source — the eval discipline below is what keeps you honest about that drift.
Eviction policy is where most teams discover their architecture is broken. Truncating the oldest turns is LRU by arrival, which is wrong: the load-bearing turn is often early in the conversation ("act as a paralegal reviewing a Delaware LLC operating agreement") and gets evicted first, while a fresh round of small-talk from turn 28 stays. A relevance-pinned eviction policy classifies turns as load-bearing (system instructions the user issued, persistent constraints, named entities the user referenced) and protects them from eviction even as their position recedes. The rest of the conversation is fair game, with eviction biased toward the middle, where the model is least likely to attend anyway thanks to the well-documented lost-in-the-middle effect.
A query plan per turn is the unlock that scrollback architectures cannot reach. Not every turn needs the same slice of history. A user asking "what did I just say?" needs the last two turns at full fidelity. A user asking "where did we land on the airline?" needs a targeted retrieval against the constraint subset of the index. A user starting a new sub-task needs almost no history at all — the system prompt and the most recent intent are sufficient. The query plan decides which slice to load, at what fidelity, before the prompt is assembled. That decision is the most leveraged knob in the whole system, and it is invisible to teams that are still thinking "tail of length N."
The hot/warm/cold tiering everyone reinvents
By the time a team has thought hard about query plans and eviction, they invariably arrive at a tiered storage model that looks the same across every framework that's serious about this problem. There is a hot tier — the in-context working set, kept small and recent and load-bearing. There is a warm tier — compressed summaries and intent extracts, retrievable cheaply, periodically refreshed. There is a cold tier — the full transcript and any extracted facts, indexed for similarity search and only paged in on demand.
Letta (the production evolution of the MemGPT line of work) names these core, recall, and archival. Mem0 and Zep ship variants of the same shape with different tradeoffs on latency and consolidation. The naming differs; the structure does not. The reason the structure converges is that it is the same structure as a CPU cache hierarchy or a database buffer pool: a small fast tier, a larger slower tier, and a cold backing store, with explicit paging operations between them. There's nothing novel about this in a database. The novelty is in noticing that you are building one.
The trap to avoid here is treating the tiers as a feature rather than a contract. The hot tier has a hard size budget set by token economics. The warm tier has a freshness budget set by how often summaries get re-derived. The cold tier has a recall budget set by retrieval latency. If those budgets are not enforced explicitly — if the hot tier can grow unboundedly because nobody wired the eviction signal, if the warm tier never refreshes because the consolidation job is best-effort — the tiers stop functioning as a hierarchy and become a flat, expensive log with extra steps.
The eval discipline catches the silent failures
The reason "the agent forgot what we said" is the most common production complaint, and not the most-fixed one, is that memory failures are silent. There is no exception, no log line, no obvious regression in any aggregate metric. The agent just gives a slightly worse answer because the load-bearing turn from earlier got paged out, or the summary lost a constraint in compression, or the retrieval grabbed the wrong slice of the past. Aggregate accuracy looks fine. The user gets quietly burned.
The eval discipline that catches this is long-horizon by construction. LOCOMO and similar benchmarks generate multi-session conversations averaging hundreds of turns and tens of thousands of tokens, with questions that probe single-hop recall, multi-hop reasoning across turns, temporal ordering, and adversarial cases where an earlier statement was retracted. The point is not to score the model in the abstract; it is to score the memory system under realistic eviction pressure, where the agent has to retrieve a fact established at turn 37 in session 4 from a corpus where turn 37 has long since been paged out of the hot tier.
The crucial design detail is that the eval should run with eviction enabled, not with infinite context. Teams that benchmark with full history loaded into a million-token window are measuring the model's recall, not their system's. Their production agent is running with truncation, summarization, and retrieval, and the eval has to mirror that or it tells you nothing useful. The questions to track over time are: when the hot tier is full, does the eviction policy keep load-bearing turns? When the summary is stale, does the agent flag uncertainty or confabulate? When retrieval grabs the wrong slice, does any guardrail catch it before the answer ships? Most teams have no answer to these questions because they never run the eval that asks them.
Pick the database
Once chat history is something you can describe in database terms, the implementation question stops being existential and becomes ordinary. You're not building an agent memory system from scratch; you're picking a database that fits your workload. The workload is read-heavy with append-only writes, the access patterns mix point lookups (last turn, specific intent), range scans (recent N turns), and similarity search (semantically similar past content), and the size scales with active sessions times conversation length.
Redis is a reasonable hot-tier backing store: low-latency point reads, TTL-driven eviction, hash structures for the turn index, and the search modules now include vector indexes good enough for the warm-tier retrieval. SQLite is the answer when the workload is per-user and the agent runs on the edge or in a single-tenant container — the entire chat history is small enough to live in a file, full-text search is a built-in module, and the operational surface area is roughly zero. Postgres with pgvector is the answer when chat history needs to participate in transactions with other application state, or when teams already have Postgres operational expertise and don't want to add a system. A custom log-structured store is justified only when none of the above match the access pattern, which is rare.
The choice is not magical. It is the same choice you make for any other read-heavy workload, with the additional constraint that the hot working set has to fit in a token budget rather than a memory budget. The architectural realization that unlocks this whole class of choices is small but stubborn: session state is the data layer of an agentic product. It is not a side effect of the chat UI. It is not free. It has a cost shape, an access pattern, and a failure mode, and it deserves the same kind of care your team would give to any other piece of production storage. Once you treat it that way, "the agent forgot what we said" stops being a mystery and starts being a tractable engineering problem with known levers.
- https://snap-research.github.io/locomo/
- https://arxiv.org/abs/2402.17753
- https://arxiv.org/abs/2307.03172
- https://www.letta.com/blog/agent-memory
- https://docs.letta.com/concepts/memgpt/
- https://research.memgpt.ai/
- https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
- https://redis.io/blog/build-smarter-ai-agents-manage-short-term-and-long-term-memory-with-redis/
- https://hermesos.cloud/blog/ai-agent-memory-systems
- https://platform.claude.com/docs/en/build-with-claude/context-windows
