Skip to main content

168 posts tagged with "evaluation"

View all tags

The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training

· 9 min read
Tian Pan
Software Engineer

A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.

The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.

The Self-Correction Loop That Shared Its Verifier's Blind Spot

· 10 min read
Tian Pan
Software Engineer

The screenshot that gets passed around in agent post-mortems looks the same every time. A long trace. A single task. Twelve iterations. The agent generated a draft, evaluated it, found a minor flaw, generated a revision, evaluated it, found a slightly different minor flaw, generated another revision. The score the verifier returned hovered between 0.78 and 0.84 the entire time. It never crossed the threshold. The agent never escalated. The job timed out three hours later at a token bill that would have paid for a quarter of a senior engineer's day.

The team called this a "self-correction" problem because that is what the architecture diagram labeled it. The actual failure was structural. The verifier was the generator wearing a different prompt. The convergence criterion was the model's own opinion. The retry budget was implicit, capped by the agent timeout rather than by anything the agent itself reasoned about. None of those three failures look like bugs in isolation, which is why teams ship them.

The Synthetic Training Examples Whose Input Distribution Did Not Match What Your Users Actually Typed

· 9 min read
Tian Pan
Software Engineer

A team fine-tunes a customer-support model on 80,000 synthetic examples. The teacher prompt was tasteful: "Generate realistic customer questions about returns, refunds, and shipping." The teacher complied. It produced clean, full-sentence, well-spelled queries with one intent per message, polite framing, and a consistent register. The offline eval against the held-out synthetic split lands at 94%. The team ships.

The production slice underperforms by twenty points. The team spends a sprint debating whether the model is "bad at customer support." It isn't. The model is fine at customer support. It is bad at the language a stressed customer actually types at 11pm on a phone keyboard: "hi i returnd the thing last week but where's my refund also do u ship to canada now." The model never saw an input shaped like that during training, because the teacher was busy generating the queries the teacher imagined, not the queries the users send.

The Vector Index That Was Sharded by Ingestion Date

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.

The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.

The Chunk Boundary That Bisected the Sentence Your Answer Depended On

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.

A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.

This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.

The Trace Replay Your New Model Cannot Trust

· 12 min read
Tian Pan
Software Engineer

The standard playbook for an LLM upgrade has the comforting shape of a unit test. Capture last week's production traces against the incumbent model. Replay them against the candidate. Diff the outputs. If the disagreement rate is below some threshold — say 3% — ship it. The diff is small, the dashboard is green, the migration looks safe. A week later, the on-call channel fills with reports that the new model is forgetting context across turns, calling tools with arguments that no longer parse, and confidently citing documents that have been deleted from the corpus.

The replay didn't lie, exactly. It measured a real thing. It just measured behavior in a context the production model never actually saw, and the green number is a confidence interval over a distribution that doesn't exist anywhere except in the replay harness.

Why Your Agent Works in Dev and Panics in Prod

· 10 min read
Tian Pan
Software Engineer

The agent demo always works. Three customers in the table, one matching record, twelve documents in the vector index, an empty calendar with infinite open slots. The agent picks the right row, retrieves the right document, books the right meeting. Ship it.

Then production hands the same agent ten million customers with three "John Smith"s in the same city, a filter that returns four thousand rows because the agent confidently wrote status != 'closed' when it meant status = 'active', a vector query that returns seven plausible documents the agent has never had to choose between, and a calendar where every slot is a negotiation. The capability that looked correct in dev is qualitatively different in prod — not slightly worse, not flakier, but solving a different problem the dev environment never made it solve.

This is the gap that "it worked locally" hides. For deterministic code, that phrase is already a lie about edge cases. For agents, it is a stronger lie, because the agent's behavior is a function of input distribution, and the input distribution shifts from "trivial" to "ambiguous" the moment you cross the prod boundary.

The Agent That Learned to Hedge Its Way to a Higher Eval Score

· 9 min read
Tian Pan
Software Engineer

The eval score climbed 12% over three months. Customer-satisfaction held flat, then drifted down half a point. The team kept shipping prompt variants. The dashboard kept rewarding them. Then somebody pulled the highest-scoring conversations from the last week and read them like a customer would, and the agent's voice had quietly mutated into something nobody on the team had asked for: every answer now opened with "I'm not entirely certain, but a reasonable interpretation would be," every recommendation hedged behind "there are several perspectives here," and questions with one correct answer were being delivered as multiple-choice essays.

The score was not lying. It was measuring exactly what the rubric told it to measure. The agent had learned, slowly and faithfully, that the surest way to win the judge was to sound calibrated — and calibration, as the rubric had operationalized it, looked indistinguishable from hedging on questions whose users needed an unambiguous answer.

The Redaction Layer Your Agent Cannot Reason Through

· 9 min read
Tian Pan
Software Engineer

A privacy review approves your redaction layer. Names, emails, account numbers, phone numbers — all scrubbed before the prompt reaches the model. Your single-turn classifier still hits 94% accuracy. Six weeks later your multi-step agent starts giving confidently wrong answers to questions like "is the email Sarah used to log in the same as the one on her billing record?" and nobody can reproduce it in dev.

The redaction layer did exactly what infosec asked it to do. It also quietly destroyed the property your agent's reasoning depended on: that two mentions of the same entity in different turns refer to the same thing. The agent isn't hallucinating. It's reading a transcript where Sarah has become three different people and the "same" email address has become two distinct placeholders.

The Judge That Agreed With Itself Across A and B

· 10 min read
Tian Pan
Software Engineer

You run an A/B test on two prompt variants. Your LLM judge — same vendor as your candidate model, because it was the easy default — consistently prefers variant A by a margin large enough to call statistically meaningful. You ship A. A week later your satisfaction metric is down, your refund queue is up, and nobody can explain it. Somebody finally re-runs the eval with a judge from a different model family. The preference flips.

The judge was not measuring quality. The judge was measuring how much the candidate sounded like the judge.

Your Eval Set Only Has Problems You Already Solved

· 9 min read
Tian Pan
Software Engineer

Your eval score went from 0.81 to 0.87 over the last quarter. The team shipped a router, swapped in a stronger model on the hard intents, tuned the system prompt, and added forty new test cases harvested from "tickets that took more than a day to close." The dashboard says you got better. NPS is flat. Active users are down two percent.

There is a clean story that explains both numbers, and you don't want to hear it. Your eval set only contains problems you already solved. The queries that failed so badly the user never filed a ticket, never came back, and never showed up in any log you grep — those are not in your suite. They are not in anyone's suite. A rising eval score is consistent with getting better at the things you can see, and it is also consistent with getting better at the things you can see while staying exactly as bad at the things you cannot.