171 posts tagged with "rag"

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

April 20, 2026 · 10 min read

Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

RAG Knowledge Base Freshness: The Staleness Problem Teams Solve Last

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Most RAG teams spend months tuning chunk sizes, experimenting with embedding models, and debating hybrid search configurations. Then they ship to production, declare success, and move on. Six months later, users start complaining that the system gives wrong answers — and the team discovers that the index they so carefully built has quietly rotted.

Index freshness is the problem that gets solved last, usually after a customer incident rather than before. Unlike retrieval quality failures that show up immediately in evals, staleness degrades silently: latency stays flat, retrieval appears functional, and standard RAG metrics like context recall and faithfulness score well — right up until the moment your system confidently returns a policy that was updated months ago.

RAG Position Bias: Why Chunk Order Changes Your Answers

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.

The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

The Reranker Gap: Why Most RAG Pipelines Skip the Most Important Layer

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

Most RAG pipelines have an invisible accuracy ceiling, and the engineers who built them don't know it's there. You tune your chunking strategy, upgrade your embedding model, swap vector databases — and the system still returns plausible but subtly wrong documents for a stubborn class of queries. The retrieval looks reasonable. The LLM sounds confident. But downstream accuracy has quietly plateaued at a level that no amount of prompt engineering will break through.

The gap almost always traces to the same missing piece: a reranker. Specifically, the absence of a cross-encoder in a second retrieval stage. It's the layer that's technically optional, practically expensive to skip, and systematically omitted from the canonical "embed, index, query" tutorials that most RAG pipelines are built from.

Temporal Context Injection: Making LLMs Actually Know What Day It Is

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your LLM-powered feature shipped. Users are asking it questions that involve time — "what's the latest policy?" "summarize what happened this week" "is this information current?" — and it answers confidently, fluently, and incorrectly.

The model doesn't know what day it is. It never did. The chat interface you're used to made that easy to forget, because those interfaces quietly inject the current date behind the scenes. Your API integration doesn't. You're shipping a system that reasons about time without knowing where it is in time — and that's a bug class that will show up in production before you think to look for it.

Tool Output Compression: The Injection Decision That Shapes Context Quality

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your agent calls a database tool. The query returns 8,000 tokens of raw JSON — nested objects, null fields, pagination metadata, and a timestamp on every row. Your agent needs three fields from that response. You just paid for 7,900 tokens of noise, and you injected all of them into context where they'll compete for attention against the actual task.

This is the tool output injection problem, and it's the most underrated architectural decision in agent design. Most teams discover it the hard way: the demo works, production degrades, and nobody can explain why the model started hedging answers it used to answer confidently.

Upstream Data Quality Is Your AI Agent's Real Bottleneck

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A team spent three months tuning prompts for their knowledge agent. They tried GPT-4, then Claude, then a fine-tuned model. They rewrote the system prompt six times. They hired a prompt engineer. The agent kept hallucinating — confidently, fluently, and wrong. The actual problem turned out to be a Confluence export from 2023 sitting in the vector store alongside a Slack archive full of contradictory, casual half-opinions about the same topics. The model was doing exactly what it was supposed to do: synthesizing the information it was given. The information was garbage.

Over 60% of AI project failures in production trace to data quality, context problems, or governance failures — not model limitations. Yet when agents misbehave, the first instinct is almost always to touch the prompt. The second instinct is to switch models. The third might be to add a reranker. The upstream database that feeds the whole pipeline rarely makes the troubleshooting list until months of work have been wasted.

When Your AI Feature Ages Out: Knowledge Cutoffs and Temporal Grounding in Production

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped in Q3. Evals looked good. Users were happy. Six months later, satisfaction scores have dropped 18 points, but your dashboards still show 99.9% uptime and sub-200ms latency. Nothing looks broken. Nothing is broken — in the traditional sense. The model is responding. The infrastructure is healthy. The feature is just quietly wrong.

This is what temporal decay looks like in production AI systems. It doesn't announce itself with errors. It accumulates as a gap between what the model knows and what the world has become — and by the time your support queue reflects it, the damage has been running for months.

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

There is a persistent assumption in AI engineering that the path to better outputs is a better model. Bigger context window, fresher training data, higher benchmark scores. In practice, the teams shipping the most capable AI products are usually doing something different: they are assembling pipelines where multiple specialized components — a retriever, a reranker, a classifier, a code interpreter, and one or more language models — cooperate to handle a task that no single model could do reliably on its own.

This architectural pattern has a name — compound AI systems — and it is now the dominant paradigm for production AI. Understanding how to build these systems correctly, and where they fail when you don't, is one of the most important skills in applied AI engineering today.

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.

Cross-Encoder Reranking in Practice: What Cosine Similarity Misses

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG pipeline retrieves the top 10 documents and your LLM still gives a wrong answer. You increase the retrieval count to 50. Still wrong. The frustrating part: the correct document was in your vector store the whole time—it was just ranked 23rd. This is not a recall problem. It's a ranking problem, and cosine similarity is the culprit.

Vector search does a decent job of finding semantically adjacent content. But "semantically adjacent" and "most useful for this specific query" are not the same thing. Cosine similarity measures the angle between two vectors in embedding space, and that angle only captures a coarse notion of topical proximity. What it cannot capture is the fine-grained interaction between the specific words in your query and the specific words in a document—the difference between "how to prevent buffer overflows" and "buffer overflow exploit techniques" is subtle at the vector level but critical for your retrieval system.

About Tian Pan