Skip to main content

23 posts tagged with "memory"

View all tags

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Conversation Summarization That Erased the Consent Flag the User Gave You

· 11 min read
Tian Pan
Software Engineer

At turn 3, your user clicked "do not retain my code." At turn 7, they toggled off "use my conversations to improve the model." At turn 12, they opted out of cross-session memory. At turn 40, your context budget runs out. The compaction pass folds turns 1–30 into a tidy 200-token summary that reads beautifully: it captures what the user asked, what your agent did, and what came of it. At turn 41, your agent — armed with that summary and the most recent ten turns — confidently writes the user's code into a memory store the user opted out of at turn 7.

Your audit log now contains a consent event at t=3, a violating action at t=41, and between them a paragraph of prose that has no field for why the action was permitted. The summarizer was trained to compress conversations, not to forward control state. Nobody told it the consent toggle was load-bearing. Nobody could have, because consent wasn't in the conversation — it was in a structured field next to it, and the structured field didn't survive the trip through summarization.

The Agent A/B Test Whose Variants Quietly Shared Long-Term Memory

· 11 min read
Tian Pan
Software Engineer

You ran the cleanest A/B test of your career. Traffic split was 50/50, the hash function looked fine, the metric pipeline did not lie, the holdout was preserved, and after three weeks the analysis converged on a clear winner: variant B improved task completion by four points, with a p-value the stats team had no objections to. You shipped it to 100%. Two weeks after the rollout, the topline metric you launched on had drifted back toward the baseline, and nobody could explain why.

Here is the part that took a while to see. Both variants were writing to and reading from the same long-term memory store. Users in variant A wrote a memory like "this customer prefers blunt summaries" and the next day, when the same user happened to be on variant B, the variant B agent loaded that memory and read it into its prompt. The reverse happened in the other direction. The experiment was not comparing prompt A against prompt B. It was comparing "prompt A reading prompt-B-shaped memories" against "prompt B reading prompt-A-shaped memories." The result was an average over a contaminated joint distribution, and the launch was a regression to a different point on the same surface.

The Agent That Memorized Your Bug: Why a Fix Is a Memory-Invalidation Event

· 9 min read
Tian Pan
Software Engineer

A few months ago, one of your downstream APIs returned a malformed timestamp — seconds where it should have been milliseconds, or a null where the schema promised a string. Your agent hit it, reasoned through the breakage, and worked out a fix: multiply by 1000, or fall back to a default, or retry with a different endpoint. It solved the immediate problem. Then it did something quietly consequential. It wrote the workaround down.

Maybe it saved a note to long-term memory: "The billing API returns timestamps in seconds; convert before use." Maybe the interaction got swept into a fine-tuning dataset, and the workaround became a learned reflex. Either way, the agent now carries a belief about the world. And last week, the API team shipped a fix. The timestamps are correct now. Nobody told the agent.

Why Your Agent Needs a Read Replica: Read/Write Splitting for Agent Memory

· 10 min read
Tian Pan
Software Engineer

Most agent memory is one undifferentiated store. The loop reads from it to assemble context at the start of every step, and writes to it after every action — new observations, running summaries, scratchpad edits. Same store, same access path, no separation. It works fine in a demo and starts to rot the moment the agent runs long enough for the store to get large.

The reason it rots is familiar to anyone who has scaled a database. A single store that serves both reads and writes is a single-primary database with no replica, and it inherits every problem that topology has under load: writes contend with reads, a half-written record gets read mid-update, and there is no isolation between the volatile working set and the durable record. We solved this for databases decades ago by splitting reads from writes. Agent memory deserves the same treatment.

The fix is not a bigger vector index or a smarter embedding model. It is an architectural one — recognizing that "memory" is two different workloads wearing the same name, and giving each the storage discipline it actually needs.

The Agent That Remembers What You Took Back: Deletion as a First-Class Memory Operation

· 10 min read
Tian Pan
Software Engineer

In March, a user told your agent to stop recommending restaurants with outdoor seating — they had moved to an apartment with a baby and late nights were over. In September, the agent suggests a rooftop bar for their anniversary. The user is annoyed, and you are confused, because you watched the March correction land. It got written to memory. It is still there. The problem is that it is sitting next to the original preference, which is also still there, and retrieval surfaced the older one because it had a slightly better embedding match for "anniversary dinner."

This is the failure mode nobody designs for. Teams spend weeks on memory writes — extraction, summarization, embedding, namespacing — and treat deletes as a someday problem. Long-term memory makes adding a fact almost free, so facts accumulate. But a memory store is not a diary. A diary is allowed to contain things that used to be true. A memory store that an agent reads from to make decisions is not, because the agent cannot tell the difference between a fact and a fossil.

Agent Memory Is a Cache With No Invalidation Policy

· 9 min read
Tian Pan
Software Engineer

Every agent framework now ships "long-term memory" as a headline feature, and every team adopts it as an unambiguous good. The agent remembers the user's preferences, prior decisions, project context, and the corrections it was given last week, so each session starts warmer than the last. The demo is irresistible: a user says "set up the project the way I like it" and the agent just does it. Nobody asks the obvious question, because the framing of the feature actively discourages it.

The question is: when does any of that stop being true?

A memory store is a cache. It holds facts about a world that does not hold still. The agent recorded "the user prefers Postgres" eight months ago, and the team has since migrated to a different database. The agent remembers "the user is on the growth team," and the user changed roles in March. The agent stored a tidy summarized conclusion from a conversation whose premises were corrected two messages later. And the memory layer surfaces all of it with exactly the same confident freshness as a fact written this morning. We have spent fifty years learning that a cache without an invalidation policy is a correctness bug. Then we built agent memory and shipped it without one.

Agent Memory Is a Compliance Surface: The Records-Management System You Didn't Sign Up to Build

· 12 min read
Tian Pan
Software Engineer

The first compliance escalation against your agent memory layer almost never arrives as a regulator's letter. It arrives as a Jira ticket from your enterprise sales engineer that says "the customer's privacy team is blocking the contract — they want to know what 'forget my user' actually means in your system, and they want a written answer by Friday." That ticket lands six to twelve months after the memory layer shipped, and the engineering team that built it discovers, in the time it takes to read the question, that they accidentally built a records-management system without any of the primitives a records-management system is supposed to have.

This is the structural problem with long-term memory in agentic products. The team building it optimizes for the things memory is sold to do — retrieval quality, latency, storage cost, the felt-personalization that makes the assistant feel like it knows the user. Nobody in the design review prices the parallel system being built at the same time: a per-user, per-tenant, multi-region data store with retention obligations, deletion semantics, audit export requirements, and a regulator's clock that starts the moment the first user's data lands in it. Memory is not a feature. It is the operational surface that every privacy regime, every enterprise procurement questionnaire, and every right-to-erasure request will eventually find.

Agent Memory Eviction: Why LRU Survives a Model Upgrade and Salience Doesn't

· 9 min read
Tian Pan
Software Engineer

The team that ships an agent with salience-weighted memory eviction has, without realizing it, signed up for a memory migration project at every model upgrade. The eviction policy looks like a quality lever — pick the smartest scoring approach, get the best recall — but it is secretly a versioning contract. When the scoring model changes, the agent's effective past changes too. None of the tooling teams build around prompts and evals catches it, because the artifact that drifted is not a prompt or an eval. It is a sequence of decisions about what to forget, made months ago, by a model that no longer exists.

LRU and LFU don't have this problem. They are deterministic, model-independent, and survive upgrades cleanly. They also throw away information that a thoughtful judge would have kept. That is the tradeoff most teams accept once, on day one, when a demo recall metric is the only thing being measured — and it is the tradeoff that bites quarterly for the rest of the agent's lifetime.

Cross-Channel Memory: When Your Agent Forgets the Email Thread

· 10 min read
Tian Pan
Software Engineer

A customer asks your assistant in Slack on Monday how to enable a feature, gets a clean answer, and goes about their day. On Friday they email asking to confirm what was decided, and the assistant — running off a different session store, with no idea Monday's chat ever happened — gives a contradictory recommendation. The customer doesn't file two tickets against two products. They file one ticket against your AI, and they're right to. To them there is one assistant. The fact that you wrote three of them, glued to three surface-specific session stores, is an implementation detail you weren't supposed to leak.

This is the cross-channel memory problem, and it sits at the intersection of two things teams underestimate: how aggressively users assume continuity, and how aggressively channel teams write their own session stores because it was the path of least resistance to ship. Recent industry data puts the gap in stark terms — only 13% of organizations successfully carry full conversation context across channels, and CSAT for fragmented multichannel support sits at 28% versus 67% for true omnichannel. The 39-point delta isn't a model quality gap. It's a memory architecture gap.

The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory

· 11 min read
Tian Pan
Software Engineer

Three teams are looking at the same event stream, each with a column called session_id, and each with a different definition of what a session is. Billing inherited a 30-minute idle window from the auth library. Eval inherited "everything until the user says 'bye' or stops typing for 10 minutes" from a chatbot framework. Memory uses a thread ID that the UI generates whenever the user clicks "New chat" — which most users never do. Three columns, three semantics, one rolled-up dashboard, three unrelated bugs that share a root cause.

This is the session boundary problem. It looks like an instrumentation nit, but it is actually a product question wearing infrastructure clothes: where does a conversation end? The honest answer is that there is no single answer — a session for billing is not the same object as a session for eval is not the same object as a session for memory — and a team that picks one default and lets the other two inherit it is shipping a billing dispute, an eval bias, and a memory leak with the same root cause.

The Summary Tax: When Compaction Eats More Tokens Than It Saves

· 10 min read
Tian Pan
Software Engineer

A long-running agent crosses its compaction threshold every twelve turns. Each pass costs an LLM call sized to the running window — first 8K tokens, then 14K, then 22K — because the span being summarized grows with every trigger. By turn sixty, the user has spent more tokens watching the agent re-summarize itself than they spent on the actual reasoning that mattered. The cost dashboard reads "user inference cost" as a single number, blissfully unaware that half of it paid for compression of context the user will never look at again.

This is the summary tax: a class of overhead that scales with conversation length, fires invisibly between user turns, and shows up as a single line item that conflates the work the user paid for with the bookkeeping the system did to manage itself. It is the closest thing modern agent architectures have to garbage-collection pause time — and most teams are running production with -verbose:gc turned off.