Skip to main content

141 posts tagged with "rag"

View all tags

The Feature Store Pattern for LLM Applications: Stop Retrieving What You Could Precompute

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications eventually converge on the same ad-hoc architecture: a scatter of cron jobs computing user summaries, a vector database queried fresh on every request, a Redis cache added when latency got embarrassing, and three different codebases that all define "user preference" slightly differently. Only later, usually after a production incident, do they recognize what they built: a feature store — a bad one, assembled accidentally.

The feature store is one of the most battle-tested patterns in traditional ML infrastructure. Applied deliberately to LLM context assembly, it eliminates the latency, cost, and consistency problems that plague most retrieval pipelines. This post explains how.

The Multilingual RAG Retrieval Gap: Why Cross-Lingual Queries Silently Fail Your Vector Search

· 11 min read
Tian Pan
Software Engineer

A team builds a RAG system. English retrieval hits 94% recall. They ship. Three months later, support tickets from French and German users pile up — the chatbot keeps returning irrelevant results or nothing at all. The engineers look at their monitoring dashboard. Overall recall: 91%. Nothing looks broken.

The corpus is English. The embedding model is English-only. The users are not. Every French query gets embedded into a vector space that was never designed to share coordinates with the English documents it's searching against. The cosine similarities aren't bad — they're geometrically meaningless. And because aggregate metrics aggregate, the problem is invisible until users complain loudly enough.

This is the multilingual RAG retrieval gap, and it's one of the most common silent failure modes in production AI systems serving non-English audiences.

When RAG Makes Your AI Worse: The Creativity-Grounding Tradeoff

· 8 min read
Tian Pan
Software Engineer

A team at a product company built a brainstorming assistant for their marketing department. They added RAG over their document corpus — campaign briefs, brand guidelines, competitor analyses — figuring the richer context would produce better ideas. Usage dropped within three weeks. The qualitative feedback: outputs felt "too safe," "too predictable," "like it just remixed our existing stuff." They removed retrieval from the brainstorming feature. Ideas improved. Engagement recovered.

This pattern repeats more often than practitioners admit. Retrieval-augmented generation has become the default architecture for grounding LLM outputs in facts, and for factual tasks it earns that default. But for generative tasks — ideation, creative writing, novel solution generation — adding a retrieval layer can silently cap the ceiling of what your model produces. Not because retrieval is broken, but because it's working exactly as designed.

Reranking Is the Real Work: Why Your Retrieval System's Bottleneck Is Never the Index

· 10 min read
Tian Pan
Software Engineer

Teams building RAG systems almost universally hit the same wall: they spend a week tuning their HNSW index parameters, add product quantization, push recall@100 from 0.81 to 0.87 — and then watch LLM output quality barely budge. The assumption baked into months of effort is that a better index equals better answers. It doesn't. The bottleneck was never the index.

The actual chokepoint is the ranking step between your candidate set and your context window. What you put into the LLM determines what comes out, and the job of ranking is to ensure that the most genuinely relevant documents, not just the most semantically similar ones, make it through. That distinction matters more than any HNSW configuration you'll ever tune.

Tool Output Schema Design: How Your Tool Responses Shape Agent Reasoning

· 9 min read
Tian Pan
Software Engineer

Most teams designing LLM agents spend considerable effort on tool selection and system prompt wording. Almost none of them think carefully about what their tools return. That's a mistake with compounding consequences — because the shape of a tool response determines how well the agent can reason about it, how much context window it consumes, and how often it hallucinates an interpretation the tool never intended.

Tool output schema design is infrastructure, not plumbing. Get it wrong and your agent will fail in ways that look like reasoning problems when they're actually schema problems.

Vector DB Sharding: Why HNSW Breaks at Partition Boundaries and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Most vector database tutorials show you how to insert a million embeddings and run a query. What they don't show you is what happens six months later, when your corpus has grown past what a single node can hold, and you're trying to shard the HNSW index your entire retrieval pipeline depends on. The answer, which vendors leave out of the marketing copy, is that HNSW graphs resist partitioning in ways that cause silent recall degradation — and the operational patterns needed to recover that quality add real complexity.

This post covers the technical reasons HNSW sharding breaks down, what recall loss looks like in practice, and the operational patterns teams use to maintain retrieval accuracy when they've outgrown a single node.

Knowledge Age Routing: Matching Queries to the Right Temporal Layer in Production AI

· 9 min read
Tian Pan
Software Engineer

Here is a scenario that surfaces in production more often than anyone likes to admit. A user asks your AI assistant what the current interest rate policy is. Your RAG system fetches a highly relevant Federal Reserve document—semantically it scores 0.91 similarity—and the model confidently returns an answer. The answer is six months out of date. The RAG index was last refreshed in October. The parametric knowledge is older still. A live API call would have returned the correct current figure in 400 milliseconds, but nobody wired up the routing logic to ask: how old is this question's answer allowed to be?

That failure is not a retrieval failure. It is a temporal routing failure. The system had access to correct information somewhere in its stack. It just sent the query to the wrong layer.

Permission-Aware Retrieval: Why Access Control in Enterprise RAG Must Live in the Vector Layer

· 9 min read
Tian Pan
Software Engineer

Here is a failure mode that shows up in nearly every enterprise RAG deployment: an employee asks the internal AI assistant a question about compensation policy. The system returns correct, specific information — pulled from an HR document the employee was never supposed to see. No one gets fired for it immediately because no one is watching the retrieval layer. But the confidential document was indexed, the user's query hit it semantically, and the model faithfully reported what it found.

The mistake isn't unusual. It's the default outcome when teams apply public-web RAG patterns to private organizational knowledge without adapting the architecture. Web RAG has no access control layer because public web content has none. Enterprise data does — and that constraint changes the entire system design.

The Summarization Validity Problem: How to Know Your AI Compressed Away What Mattered

· 10 min read
Tian Pan
Software Engineer

Summarization fails silently. Your system doesn't crash, logs don't flag an error, and the generated text looks coherent—but somewhere in the compression, the one fact that mattered for the downstream task got dropped. The RAG pipeline returns a confident answer. The multi-hop reasoner reaches a conclusion. The customer service agent gives advice. All of it grounded in a summary that no longer contains the original constraint, exception, or data point the answer depended on.

This is the summarization validity problem: the gap between a summary that is consistent with its source and a summary that preserves what the downstream task needs. Most teams don't instrument for it. They ship pipelines that validate summaries exist, not summaries that are complete.

The Preprocessing Bottleneck That Kills AI Pipeline Throughput

· 10 min read
Tian Pan
Software Engineer

A team builds a RAG-backed feature, measures end-to-end latency, finds it unacceptably slow, and immediately starts optimizing the model call. They try a smaller model, batch requests, tune temperature and token limits. After two sprints of work, latency drops by 15%. The feature is still too slow. What they never measured: the 600ms they're spending chunking text and generating embeddings before the LLM ever receives a prompt.

This pattern is common enough that it has a name in distributed systems: optimizing the wrong component. In AI pipelines, the LLM call is visible and easy to measure. Everything before it is invisible until you explicitly instrument it — and that's exactly where throughput dies.

The 'What Changed' Query Is the RAG Question Your Index Can't Answer

· 10 min read
Tian Pan
Software Engineer

A user asks your assistant, "what changed about our refund policy this quarter?" The system returns a confident, well-formatted summary of the current refund policy. The user nods, closes the chat, and acts on information that has nothing to do with the question they asked. Nothing in your eval suite caught this. Nothing in your faithfulness metric flagged it. The retrieval looked perfect — it returned highly-relevant chunks. The synthesis looked perfect — it cited every chunk it used. The only problem is that the question was about change, and your index has no concept of change.

This is the failure mode that vector-similarity retrieval cannot fix by tuning. Two versions of the same document have nearly-identical embeddings — that is what good embeddings do, they collapse semantically equivalent text into the same neighborhood. So when you ask "what changed," the retriever returns one of the versions, the LLM summarizes that version, and the answer is silently a hallucination of nothing-changed. The user cannot tell. Your eval set probably cannot tell either, because your eval set is built around "what is X" questions, not "what's different about X now."

Your Embedding Model Choice Sets the Ceiling Your LLM Can't Raise

· 11 min read
Tian Pan
Software Engineer

A team I was advising had spent two months swapping LLMs in their RAG pipeline. Claude, GPT, Gemini, then back again. Each swap shaved a few percentage points off hallucination rate but never moved the needle on the metric that mattered: their support agents still couldn't find the right knowledge base article more than 60% of the time. They were tuning the wrong layer. The retriever was returning irrelevant chunks, and no amount of LLM cleverness can answer a question from documents the retriever never surfaced.

The embedding model is the part of a RAG system that decides what the LLM is even allowed to see. It draws the geometry of your corpus — which documents land near which queries in vector space. Once that geometry is wrong, the LLM is just a confident narrator of bad context. Swapping it for a smarter one usually makes the answers more articulate, not more correct.