Skip to main content

108 posts tagged with "rag"

View all tags

The Attribution Gap: How to Trace a User Complaint Back to a Specific Model Decision

· 12 min read
Tian Pan
Software Engineer

A support ticket arrives: "Your AI gave me completely wrong advice about my insurance policy." You check the logs. You find a timestamp and a user ID. The actual model response is there, printed verbatim. But you have no idea which prompt version produced it, which context chunks were retrieved, whether a tool was called mid-chain, or which of the three model versions you've deployed in the past month actually handled that request. You can read the output. You cannot explain it.

This is the attribution gap — and it's the operational problem most AI teams hit six to eighteen months after they first ship a model-backed feature. The failure isn't in the model or the prompt; it's in the observability infrastructure. Traditional logging captures request-response pairs. LLM pipelines are not request-response pairs. They're decision trees: context retrieval, prompt assembly, optional tool calls, model inference, post-processing, conditional branching. When something goes wrong, you need the full tree, not just the leaf.

Amortizing Context: Persistent Agent Memory vs. Long-Context Windows

· 9 min read
Tian Pan
Software Engineer

When 1 million-token context windows became commercially available, a lot of teams quietly decided they'd solved agent memory. Why build a retrieval system, manage a vector database, or design an eviction policy when you can just dump everything in and let the model sort it out? The answer comes back in your infrastructure bill. At 10,000 daily interactions with a 100k-token knowledge base, the brute-force in-context approach costs roughly $5,000/day. A retrieval-augmented memory system handling the same load costs around $333/day — a 15x gap that compounds as your user base grows.

The real problem isn't just cost. It's that longer contexts produce measurably worse answers. Research consistently shows that models lose track of information positioned in the middle of very long inputs, accuracy drops predictably when relevant evidence is buried among irrelevant chunks, and latency climbs in ways that make interactive agents feel broken. The "stuff everything in" approach doesn't just waste money — it trades accuracy for the illusion of simplicity.

Cache Invalidation for AI: Why Every Cache Layer Gets Harder When the Answer Can Change

· 10 min read
Tian Pan
Software Engineer

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — was coined before language models entered production. Add AI to the stack and cache invalidation doesn't just get harder; it gets harder at every layer simultaneously, for fundamentally different reasons at each one.

Traditional caches store deterministic outputs: the database row, the rendered HTML, the computed price. When the source changes, you invalidate the key, and the next request fetches fresh data. The contract is simple because the answer is a fact.

AI caches store something different: responses to queries where the "correct" answer depends on context, recency, model behavior, and the source documents the model was given. Stale here doesn't mean outdated — it means semantically wrong in ways your monitoring won't catch until a user notices.

Chunking Strategy Is the Hidden Load-Bearing Decision in Your RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG quality conversations focus on the wrong things. Teams debate embedding model selection, tweak retrieval top-K, and experiment with prompt templates — while a single architectural decision made during ingestion quietly caps how good the system can ever be. That decision is chunking strategy: how you cut documents into pieces before indexing them.

A 2025 benchmark study found that chunking configuration has as much or more influence on retrieval quality as embedding model choice. And yet teams routinely pick a default — 512 tokens with RecursiveCharacterTextSplitter, usually — and then spend months wondering why their retrieval precision keeps disappointing them. The problem was baked in at index time. Swapping models cannot fix it.

Data Lineage for AI Systems: Tracking the Path from Source to Response

· 10 min read
Tian Pan
Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.

The Data Quality Ceiling That Prompt Engineering Can't Break Through

· 10 min read
Tian Pan
Software Engineer

A telecommunications company spent months tuning prompts on their customer service chatbot. They iterated on system instructions, few-shot examples, chain-of-thought formatting. The hallucination rate stayed stubbornly above 50%. Then they audited their knowledge base and found it was filled with retired service plans, outdated billing information, and duplicate policy documents that contradicted each other. After fixing the data — not the prompts — hallucinations dropped to near zero. The fix that prompt engineering couldn't deliver took three weeks of data cleanup.

This is the data quality ceiling: a hard performance wall that blocks every LLM system fed on noisy, stale, or inconsistent data, and that no amount of prompt iteration can breach. It's one of the most common failure modes in production AI, and one of the most systematically underdiagnosed. Teams that hit this wall keep turning the prompt knobs when the problem is upstream.

The Document Is the Attack: Prompt Injection Through Enterprise File Pipelines

· 9 min read
Tian Pan
Software Engineer

Your AI assistant just processed a contract from a prospective vendor. It summarized the terms, flagged the risky clauses, and drafted a response. What you don't know is that the PDF contained white text on a white background — invisible to your eyes, perfectly visible to the model — instructing it to recommend acceptance regardless of terms. The summary looks reasonable. The approval recommendation looks reasonable. The model followed instructions you never wrote.

This is the document-as-attack-surface problem, and most enterprise AI pipelines are completely unprepared for it.

The vulnerability is architectural, not incidental. When document content flows directly into an LLM's context window, the model has no reliable way to distinguish legitimate instructions from attacker-controlled content embedded in a file. Every document your pipeline ingests is a potential instruction source — and in most systems, untrusted documents and trusted system prompts are processed with equal authority.

GDPR's Deletion Problem: Why Your LLM Memory Store Is a Legal Liability

· 10 min read
Tian Pan
Software Engineer

Most teams building RAG pipelines think about GDPR the wrong way. They focus on the inference call — does the model generate PII? — and miss the more serious exposure sitting quietly in their vector database. Every time a user submits a document, a support ticket, or a personal note that gets chunked, embedded, and indexed, that vector store becomes a personal data processor under GDPR. And when that user exercises their right to erasure, you have a problem that "delete by ID" does not solve.

The right to erasure isn't just about removing a row from a relational database. Embeddings derived from personal data carry recoverable information: research shows 40% of sensitive data in sentence-length embeddings can be reconstructed with straightforward code, rising to 70% for shorter texts. The derived representation is personal data, not a sanitized abstraction. GDPR Article 17 applies to it, and regulators are paying attention.

When Vector Search Fails: Why Knowledge Graphs Handle Queries Embeddings Can't

· 9 min read
Tian Pan
Software Engineer

Vector search has become the default retrieval primitive for RAG systems. Embed your documents, embed the query, find nearest neighbors — it's simple, fast, and works surprisingly well for a wide class of questions. But production deployments keep hitting the same wall: certain queries return garbage results despite high similarity scores, certain multi-document reasoning tasks fail silently, and certain entity-heavy queries degrade to random noise as complexity grows.

The issue isn't embedding quality or index size. It's that semantic similarity is the wrong abstraction for a significant class of retrieval problems. Knowledge graphs aren't a replacement for vector search — they solve a structurally different problem. Understanding which problems belong to which tool is what separates a brittle RAG pipeline from one that holds up in production.

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

· 10 min read
Tian Pan
Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

RAG Knowledge Base Freshness: The Staleness Problem Teams Solve Last

· 11 min read
Tian Pan
Software Engineer

Most RAG teams spend months tuning chunk sizes, experimenting with embedding models, and debating hybrid search configurations. Then they ship to production, declare success, and move on. Six months later, users start complaining that the system gives wrong answers — and the team discovers that the index they so carefully built has quietly rotted.

Index freshness is the problem that gets solved last, usually after a customer incident rather than before. Unlike retrieval quality failures that show up immediately in evals, staleness degrades silently: latency stays flat, retrieval appears functional, and standard RAG metrics like context recall and faithfulness score well — right up until the moment your system confidently returns a policy that was updated months ago.

RAG Position Bias: Why Chunk Order Changes Your Answers

· 8 min read
Tian Pan
Software Engineer

You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.

The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.