Skip to main content

61 posts tagged with "rag"

View all tags

GraphRAG in Production: When Vector Search Hits Its Ceiling

· 9 min read
Tian Pan
Software Engineer

Your vector search looks great on benchmarks. Users are still frustrated.

The failure mode is subtle: a user asks "Which of our suppliers have been involved in incidents that affected customers in the same region as the Martinez account?" Your embeddings retrieve the incident records. They retrieve the supplier contracts. They retrieve the customer accounts. But they retrieve them as disconnected documents, and the LLM has to figure out the relationships in context — relationships that span three hops across your entity graph. At five or more entities per query, accuracy without relational structure drops toward zero. With it, performance stays stable.

This is the ceiling that knowledge graph augmented retrieval — GraphRAG — is built to address. It is not a drop-in replacement for vector search. It is a different system with a different cost structure, different failure modes, and a different class of queries where it wins decisively.

Where Production LLM Pipelines Leak User Data: PII, Residency, and the Compliance Patterns That Hold Up

· 12 min read
Tian Pan
Software Engineer

Most teams building LLM applications treat privacy as a model problem. They worry about what the model knows — its training data, its memorization — while leaving gaping holes in the pipeline around it. The embarrassing truth is that the vast majority of data leaks in production LLM systems don't come from the model at all. They come from the RAG chunks you index without redacting, the prompt logs you write to disk verbatim, the system prompts that contain database credentials, and the retrieval step that a poisoned document can hijack to exfiltrate everything in your knowledge base.

Gartner estimates that 30% of generative AI projects were abandoned by end of 2025 due to inadequate risk controls. Most of those failures weren't the model hallucinating — they were privacy and compliance failures in systems engineers thought were under control.

Long-Context Models vs. RAG: When the 1M-Token Window Is the Wrong Tool

· 9 min read
Tian Pan
Software Engineer

When Gemini 1.5 Pro launched with a 1M-token context window, a wave of engineers declared RAG dead. The argument seemed airtight: why build a retrieval pipeline with chunkers, embeddings, vector databases, and re-rankers when you can just dump your entire knowledge base into the prompt and let the model figure it out?

That argument collapses under production load. Gemini 1.5 Pro achieves 99.7% recall on the "needle in a haystack" benchmark — a single fact hidden in a document. On realistic multi-fact retrieval, average recall hovers around 60%. That 40% miss rate isn't a benchmarking artifact; it's facts your system silently fails to surface to users. And the latency for a 1M-token request runs 30–60x slower than a RAG pipeline at roughly 1,250x the per-query cost.

Long-context models are a powerful tool. They're just not the right tool for most production retrieval workloads.

The Production Retrieval Stack: Why Pure Vector Search Fails and What to Do Instead

· 12 min read
Tian Pan
Software Engineer

Most RAG systems are deployed with a vector database, a few thousand embeddings, and the assumption that semantic similarity is close enough to correctness. It is not. That gap between "semantically similar" and "actually correct" is why 73% of RAG systems fail in production, and almost all of those failures happen at the retrieval stage — before the LLM ever generates a word.

The standard playbook of "embed your documents, query with cosine similarity, pass top-k to the LLM" works in demos because demo queries are designed to work. Production queries are not. Users search for product IDs, invoice numbers, regulation codes, competitor names spelled wrong, and multi-constraint questions that a single embedding vector cannot geometrically satisfy. Dense vector search is not wrong — it is incomplete. Building a retrieval stack that actually works in production requires understanding why, and layering in the components that compensate.

Temporal Reasoning Failures in Production AI Systems

· 10 min read
Tian Pan
Software Engineer

An agent that confidently recommends products that have been out of stock for six months. A customer service bot that tells a user there's no record of the order they placed 20 minutes ago. A coding assistant that generates working code against a library API deprecated two years ago. These aren't hallucinations in the traditional sense — the model is recalling something that was once accurate. That's a different failure mode entirely, and most teams aren't equipped to detect or defend against it.

The distinction matters because the mitigations are fundamentally different. You can't prompt-engineer your way out of staleness. You can't fine-tune your way out of it either — fine-tuning on stale knowledge makes the problem worse, not better, because the model expresses outdated information with greater authority. And as models become more fluent and confident in their delivery, their confidently-wrong stale answers become harder, not easier, for users to catch.

Agentic RAG: When Your Retrieval Pipeline Needs a Brain

· 10 min read
Tian Pan
Software Engineer

Ninety percent of agentic RAG projects failed in production in 2024. Not because the technology was broken, but because engineers wired up vector search, a prompt, and an LLM, called it a retrieval pipeline, and shipped — without accounting for the compounding failure costs at every layer between query and answer.

Classic RAG is a deterministic function: embed query → vector search → stuff context → generate. It runs once, in one direction, with no feedback loop. That works when queries are clean single-hop lookups against a well-chunked corpus. It fails spectacularly when a user asks "compare the liability clauses across these five contracts," or "summarize what's changed in our infra config since the Q3 incident," or any question that requires synthesizing evidence across documents before forming an answer.

Building a Generative AI Platform: Architecture, Trade-offs, and the Components That Actually Matter

· 12 min read
Tian Pan
Software Engineer

Most teams treating their GenAI stack as a model integration project eventually discover they've actually built—or need to build—a platform. The model is the easy part. The hard part is everything around it: routing queries to the right model, retrieving context reliably, filtering unsafe outputs, caching redundant calls, tracing what went wrong in a chain of five LLM calls, and keeping costs from tripling month-over-month as usage scales.

This article is about that platform layer. Not the model weights, not the prompts—the surrounding infrastructure that separates a working proof of concept from something you'd trust to serve a million users.

Context Engineering: Why What You Feed the LLM Matters More Than How You Ask

· 11 min read
Tian Pan
Software Engineer

Most LLM quality problems aren't prompt problems. They're context problems.

You spend hours crafting the perfect system prompt. You add XML tags, chain-of-thought instructions, and careful persona definitions. You test it on a handful of inputs and it looks great. Then you ship it, and two weeks later you're staring at a ticket where the agent confidently told a user the wrong account balance — because it retrieved the previous user's transaction history. The model understood the instructions perfectly. It just had the wrong inputs.

This is the core distinction between prompt engineering and context engineering. Prompt engineering asks: "How should I phrase this?" Context engineering asks: "What does the model need to know right now, and how do I make sure it gets exactly that?" One is copywriting. The other is systems architecture.

A Year of Building with LLMs: What the Field Has Actually Learned

· 9 min read
Tian Pan
Software Engineer

Most teams building with LLMs today are repeating mistakes that others made a year ago. The most expensive one is mistaking the model for the product.

After a year of LLM-powered systems shipping into production — codegen tools, document processors, customer-facing assistants, internal knowledge systems — practitioners have accumulated a body of hard-won knowledge that's very different from what the hype cycle suggests. The lessons aren't about which foundation model to choose or whether RAG beats finetuning. They're about the unglamorous work of building reliable systems: how to evaluate output, how to structure workflows, when to invest in infrastructure versus when to keep iterating on prompts, and how to think about differentiation.

This is a synthesis of what that field experience actually shows.

Beyond RAG: Hybrid Search, Agentic Retrieval, and the Database Design Decisions That Actually Matter

· 8 min read
Tian Pan
Software Engineer

Most teams ship RAG and call it a retrieval strategy. They chunk documents, embed them, store the vectors, and run nearest-neighbor search at query time. It works well enough in demos. In production, users start reporting that the system can't find an article they know exists, misses error codes verbatim in the docs, or returns semantically similar but factually wrong passages.

The problem isn't RAG. The problem is treating retrieval as a one-dimensional problem when it's always been multi-dimensional.

Hard-Won Lessons from Shipping LLM Systems to Production

· 7 min read
Tian Pan
Software Engineer

Most engineers building with LLMs share a common arc: a working demo in two days, production chaos six weeks later. The technology behaves differently under real load, with real users, against real data. The lessons that emerge aren't philosophical—they're operational.

After watching teams across companies ship (and sometimes abandon) LLM-powered products, a handful of patterns appear again and again. These aren't edge cases. They're the default experience.

Seven Patterns for Building LLM Systems That Actually Work in Production

· 10 min read
Tian Pan
Software Engineer

The demo always works. Prompt the model with a curated example, get a clean output, ship the screenshot to the stakeholder deck. Six weeks later, the system is in front of real users, and none of the demo examples appear in production traffic.

This is the gap every LLM product team eventually crosses: the jump from "it works on my inputs" to "it works on inputs I didn't anticipate." The patterns that close that gap aren't about model selection or prompt cleverness — they're about system design. Seven patterns account for most of what separates functional prototypes from reliable production systems.