Skip to main content

141 posts tagged with "rag"

View all tags

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

· 10 min read
Tian Pan
Software Engineer

Ask a production RAG system a question your corpus cannot answer and watch what happens. It rarely says "I don't have that information." Instead, it retrieves the five highest-ranked chunks — which, having nothing better to match, are the five least-bad chunks of unrelated content — and hands them to the model with a prompt that reads something like "answer the user's question using the context below." The model, trained to be helpful and now holding text that sort of resembles the topic, produces a confident answer. The answer is wrong in a way that's architecturally invisible: the retrieval succeeded, the generation succeeded, every span was grounded in a retrieved document, and the user walked away misled.

This is the retrieval emptiness problem. It isn't a bug in any single layer. It's the emergent behavior of a pipeline that treats "top-k" as a contract and never asks whether the top-k is any good. Research published at ICLR 2025 on "sufficient context" quantified the effect: when Gemma receives sufficient context, its hallucination rate on factual QA is around 10%. When it receives insufficient context — retrieved documents that don't actually contain the answer — that rate jumps to 66%. Adding retrieved documents to an under-specified query makes the model more confidently wrong, not less.

Document Injection: The Prompt Injection Vector Inside Every RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG security discussions focus on the generation layer — jailbreaks, system prompt leakage, output filtering. Practitioners spend weeks tuning guardrails on the model side while overlooking the ingestion pipeline that feeds it. The uncomfortable reality: every document your pipeline ingests is a potential instruction surface. A single PDF can override your system prompt, exfiltrate user data, or manipulate decisions without your logging infrastructure seeing anything unusual.

This isn't theoretical. Microsoft 365 Copilot, Slack AI, and commercial HR screening tools have all been exploited through this vector in the past two years. The same attack pattern appeared in 18 academic papers on arXiv, where researchers embedded hidden prompts to bias AI peer review systems in their favor.

Stale Retrieval: The Data Quality Problem Your RAG Pipeline Is Hiding

· 10 min read
Tian Pan
Software Engineer

Your RAG system is lying to you about the past. When a user asks about current pricing, active security policies, or a feature that shipped last quarter, the retrieval pipeline returns the most semantically similar document in the index — not the most recent one. An 18-month-old pricing page and this morning's update look identical to cosine similarity. Nothing in the standard RAG stack has any concept of whether the retrieved document is still true.

This is stale retrieval, and it fails differently than hallucination. The model isn't inventing anything. It accurately summarizes real content that once existed. Standard evaluation metrics — faithfulness, groundedness, context precision — all pass. The system is confidently correct about a fact that stopped being correct months ago.

Corpus Curation at Scale: Why Your RAG Quality Ceiling Is Your Document Quality Floor

· 10 min read
Tian Pan
Software Engineer

There's a belief embedded in most RAG architectures that goes something like this: if retrieval returns the right chunks, the LLM will produce correct answers. Teams invest heavily in embedding model selection, hybrid retrieval strategies, and reranking pipelines. Then, three months after deploying to production, answer quality quietly degrades — not because the model changed, not because query patterns shifted dramatically, but because the underlying corpus rotted.

Enterprise RAG implementations fail at a roughly 40% rate, and the failure mode that practitioners underestimate most isn't hallucination or poor retrieval recall. It's document quality. One analysis found that a single implementation improved search accuracy from 62% to 89% by introducing document quality scoring — with no changes to the embedding model or retrieval algorithm. The corpus was the variable. The corpus was always the variable.

Data Provenance for AI Systems: Why Tracking Answer Origins Is Now an Engineering Requirement

· 10 min read
Tian Pan
Software Engineer

A production LLM answers a user's question incorrectly. A support ticket arrives. You pull the logs. They show the prompt, the completion, and the latency — but nothing about which documents the retrieval system surfaced, which chunks landed in the context window, or which passage the model leaned on most heavily when it synthesized the answer. You're left doing archaeology: re-running the query against a corpus that has since been updated, hoping the same results come back, wondering if the bug is in retrieval, in chunking, in the document itself, or in the model's reasoning.

This is the data provenance gap, and most AI teams don't notice it until they're already in it.

The Three Clocks Problem: Why Your AI System Is Living in Three Different Timelines

· 9 min read
Tian Pan
Software Engineer

Your AI system is confidently answering questions about a world that no longer exists. Not because the model is broken, not because retrieval failed, but because three independent clocks are ticking at different rates inside every production AI application — and nobody synchronized them.

This is the three clocks problem: wall clock, model clock, and data clock each operate on their own timeline. When they diverge, you get a system that's technically functioning but substantively wrong in ways that no error log will ever catch.

Database-Native AI: When Your Postgres Learns to Embed

· 7 min read
Tian Pan
Software Engineer

Most RAG architectures look the same: your application reads from Postgres, ships the text to an embedding API, writes vectors to Pinecone or Weaviate, and queries both systems at read time. You maintain two data stores, two consistency models, two backup strategies, and a synchronization pipeline that is always one edge case away from letting your vector index drift weeks behind your source of truth.

What if the database just did it all? That is no longer a hypothetical. PostgreSQL extensions like pgvector, pgai, and pgvectorscale — along with managed offerings like AlloyDB AI — are collapsing the entire embedding-and-retrieval stack into the database itself. The result is not just fewer moving parts. It is a fundamentally different operational model where your vectors are always transactionally consistent with the data they represent.

Knowledge Graphs Are Back: Why RAG Teams Are Adding Structure to Their Retrieval

· 8 min read
Tian Pan
Software Engineer

Your RAG pipeline answers single-fact questions beautifully. Ask it "What is our refund policy?" and it nails it every time. But ask "Which customers on the enterprise plan filed support tickets about the billing API within 30 days of their contract renewal?" and it falls apart. The answer exists in your data — scattered across three different document types, connected by relationships that cosine similarity cannot see.

This is the multi-hop reasoning problem, and it's the reason a growing number of production RAG teams are grafting knowledge graphs onto their vector retrieval pipelines. Not because graphs are trendy again, but because they've hit a concrete accuracy ceiling that no amount of chunk-size tuning or reranking can fix.

Deep Research Agents: Why Most Implementations Loop Forever or Stop Too Early

· 10 min read
Tian Pan
Software Engineer

Standard LLMs without iterative retrieval score below 10% on multi-step web research benchmarks. Deep research agents — systems that search, read, synthesize, and re-query in a loop — score above 50%. That five-fold improvement explains why every serious AI product team is building one. What it doesn't explain is why most of those implementations either run up a $15 bill chasing irrelevant tangents or declare victory after two shallow searches.

The core problem isn't building the loop. It's knowing when the loop should stop. And that turns out to be a surprisingly deep systems design challenge that touches convergence detection, cost economics, source reliability, and multi-agent coordination.

Your Embedding Pipeline Is Critical Infrastructure — Treat It Like Your Primary Database

· 9 min read
Tian Pan
Software Engineer

Most teams treat embedding generation as a one-time ETL job: run a script, populate a vector database, move on. This works fine in a demo. In production, it is a slow-motion disaster. Your vector index is not a static artifact — it is a continuously running pipeline with its own failure modes, staleness guarantees, and operational runbook. And unlike your primary database, when it breaks, nothing throws an exception. Your system keeps returning results. They are just quietly, confidently wrong.

If you are running a retrieval-augmented generation (RAG) system, a semantic search feature, or any product that depends on embeddings, your vector index deserves the same rigor you give your PostgreSQL cluster. Here is why most teams get this wrong, and what production-grade embedding infrastructure actually looks like.

GraphRAG in Production: When Vector Search Fails at Multi-Hop Reasoning

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline returns confident, well-formatted answers. The embeddings are tuned, the chunk size is optimized, and retrieval scores look great. Then a user asks "Which suppliers affected by the port strike also have contracts expiring this quarter?" and the system returns irrelevant fragments about port logistics and contract management — separately, never connecting them. This is the multi-hop reasoning gap, and it's where vector search quietly fails.

The failure isn't a tuning problem — it's architectural. Vector similarity finds documents that look like the query but cannot traverse relationships between entities scattered across different documents. GraphRAG — retrieval augmented generation backed by knowledge graphs — addresses this by making entity relationships first-class retrieval objects. But shipping it to production is harder than the demos suggest.

Hybrid Search in Production: Why BM25 Still Wins on the Queries That Matter

· 11 min read
Tian Pan
Software Engineer

BM25 was published in 1994. The math is simple enough to fit on a whiteboard. Yet in production retrieval benchmarks in 2025, it still outperforms multi-billion-parameter dense embedding models on a meaningful slice of real-world queries. Teams that discover this after deploying pure vector search tend to discover it the worst possible way: through hallucination complaints they can't reproduce in evaluation, because their eval set was built from queries that already worked.

This is the retrieval equivalent of sampling bias. Dense retrieval fails on a specific and predictable query shape. The failure is silent — the LLM still produces fluent, confident-sounding answers from whatever fragments it retrieved. No error log fires. No latency spike. Just quietly wrong answers for users querying product SKUs, error codes, API names, or anything that is lexically specific rather than semantically general.

The fix is hybrid search. But "hybrid search" is underspecified as an engineering decision. This post covers what the failure modes actually look like, how to fuse retrieval signals correctly, where the reranking layer goes, and — most critically — how to find the query types your current pipeline is silently failing on before users find them for you.