Skip to main content

141 posts tagged with "rag"

View all tags

Multimodal RAG in Production: When You Need to Search Images, Audio, and Text Together

· 12 min read
Tian Pan
Software Engineer

Most teams add multimodal RAG to their roadmap after realizing that a meaningful chunk of their corpus — product screenshots, recorded demos, architecture diagrams, support call recordings — is invisible to their text-only retrieval system. What surprises them in production is not the embedding model selection or the vector database choice. It's the gap between modalities: the same semantic concept encoded as an image and as a sentence lands in completely different regions of the vector space, and the search engine has no idea they're related.

This post covers the technical mechanics of multimodal embedding alignment, the cross-modal reranking strategies that actually work at scale, the cost and latency profile relative to text-only RAG, and the failure modes that are specific to multimodal retrieval.

Fine-tuning vs. RAG for Knowledge Injection: The Decision Engineers Consistently Get Wrong

· 10 min read
Tian Pan
Software Engineer

A fintech team spent three months fine-tuning a model on their internal compliance documentation — thousands of regulatory PDFs, policy updates, and procedural guides. The results were mediocre. The model still hallucinated specific rule numbers. It forgot recent policy changes. And the one metric that actually mattered (whether advisors trusted its answers enough to stop double-checking) barely moved. Two weeks later, a different team built a RAG pipeline over the same document corpus. Advisors started trusting it within a week.

The fine-tuning team hadn't made a technical mistake. They'd made a definitional one: they were solving a knowledge retrieval problem with a behavior modification tool.

Graph Memory for LLM Agents: The Relational Blind Spots That Flat Vectors Miss

· 10 min read
Tian Pan
Software Engineer

A customer service agent knows that the user prefers morning delivery. It also knows the user's primary address is in Seattle. What it cannot figure out is that the Seattle address is a work address used only on weekdays, and the morning delivery window does not apply there on Mondays because of a building restriction the user mentioned three months ago. Each fact is retrievable in isolation. The relationship between them is not.

This is the failure mode that bites production agents working from flat vector stores. Each piece of information exists as an embedding floating in high-dimensional space. Similarity search retrieves facts that match a query. It does not recover the structural connections between facts — the edges that give them meaning in combination.

Most agent memory architectures are built around vector databases because they are fast, simple to set up, and work well for the majority of retrieval tasks. The failure cases are subtle enough that they often survive into production before anyone notices the pattern.

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

· 9 min read
Tian Pan
Software Engineer

Most RAG tutorials treat chunking as a footnote: split your documents into 512-token chunks, embed them, store them in a vector database, and move on to the interesting parts. This works well enough on toy examples — Wikipedia articles, clean markdown docs, short PDFs. It falls apart in production.

A recent study deploying RAG for clinical decision support found that the fixed-size baseline achieved 13% fully accurate responses across 30 clinical questions. An adaptive chunking approach on the same corpus: 50% fully accurate (p=0.001). The documents were the same. The LLM was the same. Only the chunking changed. That gap is not a tuning problem or a prompt engineering problem. It is a structural failure in how most teams split documents.

RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong

· 9 min read
Tian Pan
Software Engineer

Most teams building RAG systems think they have two failure modes: retrieval fails to find the relevant document, or the LLM hallucinates despite having it. The first is measured obsessively — recall@K, MRR, NDCG. The second is treated as the model's problem. Neither framing is complete.

There's a third failure mode that sits between them: retrieval succeeds (the relevant document ranks in the top-K), but the retrieved context doesn't actually contain enough information to answer the question correctly. The model gets confident, generates a plausible answer, and gets it wrong. Research on frontier models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 shows this happens at rates above 50% on multi-step queries — and most production systems have no instrumentation to detect it.

The RAG Freshness Problem: How Stale Embeddings Silently Wreck Retrieval Quality

· 12 min read
Tian Pan
Software Engineer

Your RAG system launched three months ago with impressive retrieval accuracy. Today, it's confidently wrong about a third of what users ask — and nothing in your monitoring caught the change. No errors logged. No latency spikes. The semantic similarity scores look healthy. But the documents being retrieved are outdated, and the model answers with full confidence because the retrieved context looks authoritative.

This is the RAG freshness problem: semantic similarity does not care about time. An embedding of a deprecated API reference scores just as high as a current one. A policy document from last quarter retrieves ahead of the updated version. The system doesn't know and can't tell. Most teams discover their index is weeks or months stale only after a user complaint — and by then, users have already quietly stopped trusting it.

The Context Stuffing Antipattern: Why More Context Makes LLMs Worse

· 9 min read
Tian Pan
Software Engineer

When 1M-token context windows shipped, many teams took it as permission to stop thinking about context design. The reasoning was intuitive: if the model can see everything, just give it everything. Dump the document. Pass the full conversation history. Forward every tool output to the next agent call. Let the model sort it out.

This is the context stuffing antipattern, and it produces a characteristic failure mode: systems that work fine in early demos, then hit a reliability ceiling in production that no amount of prompt tweaking seems to fix. Accuracy degrades on questions that should be straightforward. Answers become hedged and non-committal. Agents start hallucinating joins between documents that aren't related. The model "saw" all the right information — it just couldn't find it.

Embedding Models in Production: Selection, Versioning, and the Index Drift Problem

· 10 min read
Tian Pan
Software Engineer

Your RAG answered correctly yesterday. Today it contradicts itself. Nothing obvious changed — except your embedding provider quietly shipped a model update and your index is now a Frankenstein of mixed vector spaces.

Embedding models are the unsexy foundation of every retrieval-augmented system, and they fail in ways that are uniquely hard to diagnose. Unlike a prompt change or a model parameter tweak, embedding model problems surface slowly, as silent quality degradation that your evals don't catch until users start complaining. This post covers three things: how to pick the right embedding model for your domain (MTEB scores mislead more than they help), what actually happens when you upgrade a model, and the versioning patterns that let you swap models without rebuilding from scratch.

GraphRAG in Production: When Vector Search Hits Its Ceiling

· 9 min read
Tian Pan
Software Engineer

Your vector search looks great on benchmarks. Users are still frustrated.

The failure mode is subtle: a user asks "Which of our suppliers have been involved in incidents that affected customers in the same region as the Martinez account?" Your embeddings retrieve the incident records. They retrieve the supplier contracts. They retrieve the customer accounts. But they retrieve them as disconnected documents, and the LLM has to figure out the relationships in context — relationships that span three hops across your entity graph. At five or more entities per query, accuracy without relational structure drops toward zero. With it, performance stays stable.

This is the ceiling that knowledge graph augmented retrieval — GraphRAG — is built to address. It is not a drop-in replacement for vector search. It is a different system with a different cost structure, different failure modes, and a different class of queries where it wins decisively.

Where Production LLM Pipelines Leak User Data: PII, Residency, and the Compliance Patterns That Hold Up

· 12 min read
Tian Pan
Software Engineer

Most teams building LLM applications treat privacy as a model problem. They worry about what the model knows — its training data, its memorization — while leaving gaping holes in the pipeline around it. The embarrassing truth is that the vast majority of data leaks in production LLM systems don't come from the model at all. They come from the RAG chunks you index without redacting, the prompt logs you write to disk verbatim, the system prompts that contain database credentials, and the retrieval step that a poisoned document can hijack to exfiltrate everything in your knowledge base.

Gartner estimates that 30% of generative AI projects were abandoned by end of 2025 due to inadequate risk controls. Most of those failures weren't the model hallucinating — they were privacy and compliance failures in systems engineers thought were under control.

Long-Context Models vs. RAG: When the 1M-Token Window Is the Wrong Tool

· 9 min read
Tian Pan
Software Engineer

When Gemini 1.5 Pro launched with a 1M-token context window, a wave of engineers declared RAG dead. The argument seemed airtight: why build a retrieval pipeline with chunkers, embeddings, vector databases, and re-rankers when you can just dump your entire knowledge base into the prompt and let the model figure it out?

That argument collapses under production load. Gemini 1.5 Pro achieves 99.7% recall on the "needle in a haystack" benchmark — a single fact hidden in a document. On realistic multi-fact retrieval, average recall hovers around 60%. That 40% miss rate isn't a benchmarking artifact; it's facts your system silently fails to surface to users. And the latency for a 1M-token request runs 30–60x slower than a RAG pipeline at roughly 1,250x the per-query cost.

Long-context models are a powerful tool. They're just not the right tool for most production retrieval workloads.

The Production Retrieval Stack: Why Pure Vector Search Fails and What to Do Instead

· 12 min read
Tian Pan
Software Engineer

Most RAG systems are deployed with a vector database, a few thousand embeddings, and the assumption that semantic similarity is close enough to correctness. It is not. That gap between "semantically similar" and "actually correct" is why 73% of RAG systems fail in production, and almost all of those failures happen at the retrieval stage — before the LLM ever generates a word.

The standard playbook of "embed your documents, query with cosine similarity, pass top-k to the LLM" works in demos because demo queries are designed to work. Production queries are not. Users search for product IDs, invoice numbers, regulation codes, competitor names spelled wrong, and multi-constraint questions that a single embedding vector cannot geometrically satisfy. Dense vector search is not wrong — it is incomplete. Building a retrieval stack that actually works in production requires understanding why, and layering in the components that compensate.