Skip to main content

141 posts tagged with "rag"

View all tags

Hierarchical Memory Compaction: The Four Tiers Your Agent Memory Is Missing

· 11 min read
Tian Pan
Software Engineer

Most agent memory systems collapse a four-layer problem into two layers and then act surprised when the seams show. There is the conversation buffer that gets truncated when it overflows the context window, and there is the vector store of "long-term memory" that everything older than the buffer gets dumped into. That is not a memory architecture. That is a queue and a junk drawer.

The agent that re-asks a regular user the same onboarding question three Mondays in a row is not failing because the model is bad. It is failing because there is no place in the system that holds "things this user has told me across sessions" with a different lifetime than "things every user has ever told me about how the product works." Those are different memories. They have different access patterns, different privacy contracts, and different rules for when to forget. Conflating them is the architectural mistake — and it has a fix.

The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR

· 10 min read
Tian Pan
Software Engineer

A tool labeled "stateless" is a promise the runtime cannot keep. Behind the function signature sits a Redis cache, a vector index, an embedding store, a rate-limit table, a memoization layer, an LRU on the hot path — any one of which is a shared substrate where one user's data can land on another user's response. The function is stateless. The system is not. And in 2026, this is the most common privacy bug I see in agentic systems, because almost no one tests for it.

The shape of the bug is depressingly familiar to anyone who has worked on classic web apps. Insecure Direct Object Reference — IDOR — was the bread and butter of bug bounty for a decade: a request handler that accepts a record ID and returns the record without checking whether the caller is allowed to see it. The AI-era version is the same bug with a worse blast radius: a tool call that accepts a query and returns data without checking whether the caller's tenant owns that data. The query is in natural language. The cache key is a hash. The retrieval is approximate. None of those things absolve you of authorization, but each of them makes the bug harder to spot in code review.

Per-Vector Version Tags: The Missing Column Behind Every Embedding Migration

· 10 min read
Tian Pan
Software Engineer

A new embedding model lands. The benchmark numbers are 4% better. A staff engineer files the ticket: "Upgrade embeddings to v3." Two weeks later the index has been re-embedded, the alias has been swapped, and the team has shipped the change behind a feature flag. Six weeks later, support tickets pile up. Search results "feel off." A retro is scheduled. Nobody can explain what regressed because nothing crashed and every dashboard is green.

The problem is not the model swap. The problem is that the vector store has no idea which vectors came from which model. There is no column for it. There is no migration table tracking which records have been backfilled. There is no alembic_version row, no schema_migrations table, no pg_dump of the previous state. The team treated an embedding upgrade like a config flip, and the vector store had no schema-level concept that would have stopped them.

Embedding migrations need the same artifact that database migrations have relied on for two decades: a per-record version tag, written into every vector, queried on every read, and used as the gating criterion for cutover and rollback. It is the single column most teams forget to add, and adding it later costs more than adding it up front.

The Reranker Is the Silent Second Model Your RAG Eval Never Measures

· 10 min read
Tian Pan
Software Engineer

A typical RAG pipeline ships with two models, not one. The retriever pulls 50 to 100 candidates from the vector store, and a reranker — a cross-encoder, an LLM-as-judge prompt, or a hybrid — re-scores those candidates and hands the top 5 to the answer model. Your eval suite measures end-to-end answer quality. It measures retriever recall@k. It does not measure the reranker. So when the reranker quietly drifts, the dashboard renders "answer quality dropped 4 points" with no causal arrow, and the team spends three days debugging a prompt that is not the problem.

The reranker is the silent second model. It sits between the retriever and the generator, it has its own scoring distribution, its own prompt (if it's LLM-based) or its own weights (if it's a cross-encoder), and it can regress independently of every other component. Most teams never grade it in isolation. The eval suite they wrote treats the pipeline like one model with a long context window, when it's actually two models in series with an interface neither team owns.

Retrieval Sprawl: When 'Just Add RAG' Becomes the Architectural Diversion

· 11 min read
Tian Pan
Software Engineer

The pattern is so familiar it's invisible. The model hallucinates a fact, so the team adds a retrieval step. Three weeks later, the model picks the wrong tool from a growing inventory, so they add a retrieval step on the tool catalog. The model's answers feel too generic, so they add a retrieval step on past good answers. A quarter passes, and the system is now a pile of retrievers gluing together a prompt that, fundamentally, still has the original problem.

What changed isn't the failure rate — it's the failure mode's name. "Model wrong" became "retrieval missed," which sounds more tractable but isn't. The eval suite scores higher because the retrieved context is, by construction, in-distribution for the test set. Production tells a different story, but by then the architecture has three retrieval layers, each with its own embedding model, index refresh cadence, and on-call rotation, and nobody wants to be the engineer who proposes ripping them out.

This is retrieval sprawl. It's an architectural diversion: a way of moving a hard problem (prompt design, model capability, ambiguous specifications) into a more comfortable problem (information retrieval engineering) without actually solving anything.

Your Vector Store Has Hot Keys: Why ANN Indexes Lie About Production Cost

· 10 min read
Tian Pan
Software Engineer

The vector index your team picked was benchmarked on a workload that doesn't exist in production. Every public ANN benchmark — VIBE, ann-benchmarks, the comparison table on the database vendor's landing page — runs queries sampled uniformly from the corpus, so every neighbor lookup costs roughly the same and every shard sees roughly equal load. Real retrieval traffic does not look like that. It looks Zipfian: a small fraction of queries (today's news, the trending product, the recurring support intent, the few hundred questions a customer support team gets all day) hits a small fraction of embeddings a hundred times more often than the median. The benchmark says HNSW recall is 0.97 at 50ms p99. Production says one shard is melting and the rest are bored.

The mismatch is not a tuning problem. It's that vector retrieval inherits the access-skew profile of every other database workload, and the indexes the field has standardized on were not designed with that profile in mind. The cache layer your KV store gets for free — the OS page cache warming up the rows you read most often, the LRU on a hot key — does not exist for ANN, because the graph is walked in graph order, not access order. The hot embeddings stay cold in memory because the search algorithm's traversal pattern looks random to the page cache, and your "popular" cluster lives on a single shard whose CPU runs hot while the rest of the fleet idles.

Embedding Migrations Are the New Schema Migrations

· 12 min read
Tian Pan
Software Engineer

The first time most teams swap an embedding model in production, they treat it as a batch job. Re-run the embedder, build a new index, swap the alias, deploy. Latency stays normal. Error rates stay zero. Every query returns results. And retrieval quality silently regresses for weeks before anyone notices, because the symptom is "users complain the answers feel off," not a red dashboard.

This is not a deployment problem. It is a schema migration that the team has decided to run blind. The old embedding space and the new one are different reference frames; the cosine geometry that used to mean "these two paragraphs are about the same topic" no longer means that with the same numerical confidence. Documents and queries that used to cluster together drift apart non-uniformly. Re-rankers trained on the old distribution start firing on examples that no longer match what they learned. The eval suite that scores green on pointwise relevance misses all of it, because no individual document moved very far while the entire graph rotated.

Treat the swap like a database migration and almost everything that goes wrong becomes preventable. Treat it like a batch job and the regressions arrive on a schedule that nobody owns.

The Knowledge Cutoff Is a UX Surface, Not a Footnote

· 12 min read
Tian Pan
Software Engineer

The model has a knowledge cutoff. The user does not know what it is. The product, in almost every case, does not tell them. And on the day the user asks a question whose right answer changed three months ago, the assistant gives a confidently-stated wrong one — not because the model failed, but because the product never gave it a way to flag the gap. The trust contract between your users and your assistant is implicit, asymmetric, and silently broken every time the world moves and your UX pretends it didn't.

The dominant pattern is to treat the cutoff as a footnote: a line of disclosure copy buried in a help center, a /about page no one reads, a one-time tooltip dismissed in week one. That framing is a bug. Knowledge cutoff is not a property of the model the way "context length" is. It is a UX surface — instrumented, designed, and evolved — and treating it as anything less ships a product that confabulates around its own ignorance in a register the user cannot audit.

Knowledge Graph Staleness Has a Different SLA Than Vector Staleness

· 10 min read
Tian Pan
Software Engineer

The vector index is wrong by approximately ten percent and nobody panics. The knowledge graph is wrong by one missing edge and somebody ships a wrong answer to a regulator. The two failure modes look identical from the data engineering org chart — both are "the index is stale" — and they sit behind the same change-data-capture pipeline with the same lag tolerance. The pipeline was sized for the vector workload because that was the louder consumer. The graph silently inherited those defaults, and the silence is the bug.

Vector retrieval and graph retrieval fail differently under staleness, and treating them as the same kind of lag problem is how you end up with a system that scores well on RAG benchmarks and is silently wrong on multi-hop queries — the silently-wrong case being, of course, the one users notice last. The fix is not faster pipelines. The fix is recognizing that "stale" means two different things, designing freshness tiers per edge class, and building the eval that catches the difference before a regulator does.

Long-Context vs RAG in 2026: Why It Is a Per-Feature Decision, Not an Architecture Religion

· 13 min read
Tian Pan
Software Engineer

The economics of long-context vs RAG have flipped twice in two years, and the team that picked an architecture in either of those windows is now paying the wrong tax everywhere. In 2024 the trend line said stuff everything in the context window because the windows kept growing and the per-token price kept falling, so retrieval pipelines were dismissed as legacy plumbing. In 2025 the consensus reversed: context rot research showed that the effective recall on million-token prompts collapsed in the middle of the window, latency on full-window calls turned into a UX problem, and the bills came back loud, so retrieval was rehabilitated. By 2026 the right answer is neither slogan. It is a per-feature decision, made at design time with a four-axis trade-off written down, because picking one architecture for the whole product is the cheap way to be wrong on every feature at once.

The mental model that keeps biting teams is treating long-context vs RAG as a roadmap commitment instead of a per-surface choice. You read one influential blog, you pick a side, you hire engineers who specialize in that side, you write a platform doc that codifies it, and now every new feature gets the same architecture regardless of whether it fits. The features that need fresh data live with stale context. The features that need scalable corpora pay for retrieval infrastructure they will never use. The features that need citation provenance ship without it. None of these are bugs. They are the predictable cost of treating a feature-level decision as a product-level one.

The Indexing Policy Committee Nobody Convened: RAG Corpus Governance Beyond the One-Time Migration

· 9 min read
Tian Pan
Software Engineer

Two years ago, a team pointed their retrieval index at the wiki, the Zendesk export, and a snapshot of the public docs. Last week, that same index returned a deprecated runbook that told an SRE to restart a service that no longer exists. The runbook had been deprecated for eighteen months. Nobody owned its retirement, so nobody retired it. The agent confidently cited it. The model wasn't wrong; the corpus was.

This is the failure mode that doesn't show up in retrieval evals: the corpus is treated as a one-time engineering decision when it's actually an ongoing governance problem. The team that scoped the initial ingestion is long gone. The legal review that should have flagged the customer-confidential PDFs never happened, because nobody told legal there was a pipeline. The "freshness strategy" is a Slack message from someone who left in Q3. The retrieval index has become a shared inbox for every document anyone ever scraped, and the bar for inclusion has drifted to "whatever was easy to ingest."

The RAG Read-After-Write Race: When Your Vector Index Cites a Document That No Longer Exists

· 10 min read
Tian Pan
Software Engineer

A user asks your assistant a question at 14:32:07. Your retriever fires at 14:32:08 and pulls back five chunks from the policy handbook. The model thinks for a few seconds, drafts a response, and at 14:32:12 streams back an answer that confidently cites section 4.3 — the section that an admin deleted at 14:32:10 because it was wrong. The user reads an authoritative quotation from a document that no longer exists, complete with a clickable link that returns 404.

Nothing in your stack errored. The retriever returned a valid hit. The model produced fluent, grounded prose. The citation pointed at a real chunk ID that was real when the retrieval happened. And yet the answer is, by every reasonable definition, a hallucination — not because the model made something up, but because the world changed underneath the pipeline between the moment it looked and the moment it spoke.

This is the RAG read-after-write race, and most production pipelines have no defense against it.