Skip to main content

91 posts tagged with "rag"

View all tags

The Embedding API Hidden Tax: Why Vector Spend Quietly Eclipses Generation

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter had a moment of quiet panic when their finance partner flagged the AI bill. They had assumed, like most teams do, that the expensive line item would be generation — the GPT-class calls behind chat, summarization, and agent reasoning. It wasn't. Their monthly embedding spend had silently crossed generation in January, doubled it by March, and was on track to triple it by mid-year. Nobody had modeled it because per-token pricing on embedding models looks like rounding error: two cents per million tokens for small, thirteen cents for large. At that rate, who budgets for it?

The answer is: anyone whose product survives past prototype and starts indexing things at scale. Semantic search over a growing corpus, duplicate detection, classification, clustering, reindexing when you swap models — every one of these workloads burns embedding tokens by the billion, not by the million. And unlike generation, which is gated by user requests, embedding throughput is only gated by what you decide to index. That decision rarely gets a cost review.

This post is about the specific mechanics of how embedding spend escalates, the architectural levers that bend the curve, and the breakeven math for moving off a hosted API onto something you run yourself.

Embedding Model Rotation Is a Database Migration, Not a Deploy

· 11 min read
Tian Pan
Software Engineer

Somewhere in a staging channel, an engineer writes "bumping the embedder to v3, new model scored +4 on MTEB, merging after the smoke test." Two days later support tickets start trickling in about search results that feel "weirdly off." A week later retrieval precision is down fourteen points, cosine scores have collapsed from 0.85 into the 0.65 range, and nobody can explain why — because the deploy looked identical to the last five model bumps. It wasn't a deploy. It was a database migration wearing a deploy's costume.

Embedding model rotation is the most misfiled change type in AI infrastructure. It lands in your system through the same channels as a prompt tweak or a generation-model pin update — a config file, a PR, a CI check — so it gets the governance of a config change. But under the hood, a new embedder does not produce a better version of your old vectors. It produces vectors that live in a different coordinate system entirely, where cosine similarity across the two manifolds is a category error. The correct mental model is not "rev the dependency." It is "swap the primary key encoding on a fifty-million-row table while serving reads."

The Model Bill Is 30% of Your Inference Cost

· 8 min read
Tian Pan
Software Engineer

A finance lead at a mid-sized AI company told me last quarter they had "optimized their LLM spend" by switching their agent backbone from Sonnet to Haiku. The token bill dropped 22%. The total inference cost per resolved ticket went down 4%. When we pulled the full decomposition, the model line item was roughly a third of the per-request cost. Retrieval, reranking, observability, retry amplification, and the human-in-the-loop review queue ate the rest — and none of those got cheaper when they swapped models.

This is the most common accounting error I see in AI teams right now. Token cost is the line item on the invoice you pay every month, so it becomes the number everyone optimizes. But for any non-trivial production system — RAG, agents, anything with tool use or evaluation gates — the model inference is often 30 to 50% of the real unit economics. The rest sits in places your engineering dashboard doesn't surface and your finance team doesn't categorize as "AI spend."

No Results Is Not Absence: Why Agents Treat Retrieval Failure as Proof

· 10 min read
Tian Pan
Software Engineer

The most dangerous sentence in an agent transcript is not a hallucination. It is four calm words: "I could not find it." The agent sounds epistemically humble. It sounds like due diligence. It sounds, to any downstream reader or caller, exactly like a fact. And yet the statement carries no information about whether the thing exists. It only carries information about what happened when a specific tool, invoked with a specific query, consulted a specific index that the agent happened to have access to at that moment.

Between those two readings lies a production incident waiting to happen. A support agent tells a customer "we have no record of your order" because a replication lag delayed the write to the read replica by ninety seconds. A coding agent declares "there are no tests for this module" because it searched a directory that did not contain the test folder. A compliance agent replies "no prior violations on file" because the audit index had not ingested last week's report. In each case the agent's output is grammatically a negation, but epistemically it is a shrug that has been re-typed as a claim.

Popularity Bias in Vector Retrieval: Why the Same Five Chunks Dominate Every Query

· 10 min read
Tian Pan
Software Engineer

Pull a week of retrieval logs from any mature RAG system and sort chunks by how often they were returned. The shape is almost always the same: a small cluster of chunks appears in thousands of queries while the vast majority of your corpus shows up a handful of times or never at all. The system isn't broken. It's doing exactly what its index was built to do — and that is the problem.

This is popularity bias in vector retrieval, and it gets worse as your corpus grows. A few chunks become gravity wells that win retrieval across queries that have little to do with each other, while your long tail quietly disappears below the top-k cutoff. Your RAG system starts feeling "generic" — users ask specific questions and get answers that sound like they were written for someone else. By the time product complains, the distribution has already been lopsided for weeks.

Your RAG Chunker Is a Database Schema Nobody Code-Reviewed

· 11 min read
Tian Pan
Software Engineer

The first time a retrieval quality regression lands in your on-call channel, the debugging path almost always leads somewhere surprising. Not the embedding model. Not the reranker. Not the prompt. The culprit is a one-line change to the chunker — a tokenizer swap, a boundary rule tweak, a stride adjustment — that someone merged into a preprocessing notebook three sprints ago. The fix touched zero lines of production code. It rebuilt the index overnight. And now accuracy is down four points across every tenant.

The chunker is a database schema. Every field you extract, every boundary you draw, every stride you pick defines the shape of the rows that land in your vector index. Change any of them and you have altered the schema of an index that other parts of your system — retrieval logic, reranker features, evaluation harnesses, downstream prompts — depend on as if it were stable. But because the chunker usually lives in a notebook or a small Python module that nobody labels as "infrastructure," these changes ship with the rigor of a config tweak and the blast radius of an ALTER TABLE.

Why Your RAG Citations Are Lying: Post-Hoc Rationalization in Source Attribution

· 10 min read
Tian Pan
Software Engineer

Show a user an AI answer with a link at the end of each sentence, and the needle on their trust meter swings halfway across the dial before they have read a single cited passage. That is the whole marketing pitch of enterprise RAG: "grounded," "sourced," "verifiable." It is also the most-shipped, least-tested claim in AI engineering. Recent benchmarks find that between 50% and 90% of LLM responses are not fully supported — and sometimes contradicted — by the sources they cite. On adversarial evaluation sets, up to 57% of citations from state-of-the-art models are unfaithful: the model never actually used the document it is pointing at. The citation was attached after the fact, to rationalize an answer the model had already decided to give.

This is not a retrieval bug. You can have perfect retrieval and still get lying citations, because the failure is architectural. The generator writes prose first and stitches links on second. The links look like evidence. They are decoration.

The Attribution Gap: How to Trace a User Complaint Back to a Specific Model Decision

· 12 min read
Tian Pan
Software Engineer

A support ticket arrives: "Your AI gave me completely wrong advice about my insurance policy." You check the logs. You find a timestamp and a user ID. The actual model response is there, printed verbatim. But you have no idea which prompt version produced it, which context chunks were retrieved, whether a tool was called mid-chain, or which of the three model versions you've deployed in the past month actually handled that request. You can read the output. You cannot explain it.

This is the attribution gap — and it's the operational problem most AI teams hit six to eighteen months after they first ship a model-backed feature. The failure isn't in the model or the prompt; it's in the observability infrastructure. Traditional logging captures request-response pairs. LLM pipelines are not request-response pairs. They're decision trees: context retrieval, prompt assembly, optional tool calls, model inference, post-processing, conditional branching. When something goes wrong, you need the full tree, not just the leaf.

Amortizing Context: Persistent Agent Memory vs. Long-Context Windows

· 9 min read
Tian Pan
Software Engineer

When 1 million-token context windows became commercially available, a lot of teams quietly decided they'd solved agent memory. Why build a retrieval system, manage a vector database, or design an eviction policy when you can just dump everything in and let the model sort it out? The answer comes back in your infrastructure bill. At 10,000 daily interactions with a 100k-token knowledge base, the brute-force in-context approach costs roughly $5,000/day. A retrieval-augmented memory system handling the same load costs around $333/day — a 15x gap that compounds as your user base grows.

The real problem isn't just cost. It's that longer contexts produce measurably worse answers. Research consistently shows that models lose track of information positioned in the middle of very long inputs, accuracy drops predictably when relevant evidence is buried among irrelevant chunks, and latency climbs in ways that make interactive agents feel broken. The "stuff everything in" approach doesn't just waste money — it trades accuracy for the illusion of simplicity.

Cache Invalidation for AI: Why Every Cache Layer Gets Harder When the Answer Can Change

· 10 min read
Tian Pan
Software Engineer

Phil Karlton's famous quip — "There are only two hard things in Computer Science: cache invalidation and naming things" — was coined before language models entered production. Add AI to the stack and cache invalidation doesn't just get harder; it gets harder at every layer simultaneously, for fundamentally different reasons at each one.

Traditional caches store deterministic outputs: the database row, the rendered HTML, the computed price. When the source changes, you invalidate the key, and the next request fetches fresh data. The contract is simple because the answer is a fact.

AI caches store something different: responses to queries where the "correct" answer depends on context, recency, model behavior, and the source documents the model was given. Stale here doesn't mean outdated — it means semantically wrong in ways your monitoring won't catch until a user notices.

Chunking Strategy Is the Hidden Load-Bearing Decision in Your RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG quality conversations focus on the wrong things. Teams debate embedding model selection, tweak retrieval top-K, and experiment with prompt templates — while a single architectural decision made during ingestion quietly caps how good the system can ever be. That decision is chunking strategy: how you cut documents into pieces before indexing them.

A 2025 benchmark study found that chunking configuration has as much or more influence on retrieval quality as embedding model choice. And yet teams routinely pick a default — 512 tokens with RecursiveCharacterTextSplitter, usually — and then spend months wondering why their retrieval precision keeps disappointing them. The problem was baked in at index time. Swapping models cannot fix it.

Data Lineage for AI Systems: Tracking the Path from Source to Response

· 10 min read
Tian Pan
Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.