Skip to main content

160 posts tagged with "evaluation"

View all tags

The Latency-Budget Router That Was a Quality-Loss Router by Another Name

· 10 min read
Tian Pan
Software Engineer

A model router that optimizes a single loss function will deliver exactly what that loss function asks for, and nothing else. When the function is "stay under the p95 latency target," every query that would have benefited from extended reasoning gets snapped to the cheapest path the router can defend, because the fast model returns under the SLO and the slow-but-correct model would not. The latency dashboard turns green. The aggregate eval moves a fraction of a point and the team rounds it to noise. The per-slice view nobody graphs is where the actual regression lives: concentrated in the multi-step, ambiguous, and out-of-distribution queries that should have been routed to reasoning and instead got the model that finishes fast and is wrong with confidence.

This is not a routing bug. The router is doing exactly what it was built to do. The bug is in the framing — a system whose optimizer is denominated entirely in latency will produce quality regressions invisible to the metric the team is paid to keep green. It will then ship those regressions silently, because the people watching the dashboard are not the people watching the answers.

The Localized System Prompt Your Model Performs Worse Against Than the English Original

· 11 min read
Tian Pan
Software Engineer

Your English system prompt took six weeks to tune. A staff engineer rewrote the constraint list four times, the eval suite finally cleared 94% on the held-out task set, and the launch checklist green-lit it for production. Then the i18n team picked it up, ran it through the same translation pipeline that handles button labels and tooltips, and shipped the Japanese, German, Hindi, and Arabic variants the next sprint. The launch dashboard for non-English markets shows the same task volume, the same user funnel, and — until a support ticket from a Tokyo customer surfaces six months later — the same green status.

The Tokyo customer's complaint is that the agent ignored an instruction the English prompt explicitly forbids. You re-read the Japanese prompt and it says the same thing, semantically. You re-run the English eval suite against the English variant and it passes. There is no eval suite for the Japanese variant. There never was.

The Middle-Context Blindness Your Retrieval Pipeline Never Measured

· 8 min read
Tian Pan
Software Engineer

The retrieval logs are clean. Recall@10 against your hand-labeled query set has not regressed in months. The answer-quality dashboard says faithfulness is holding above 90%. Then a customer pastes a question into your support agent, the gold passage is right there at position 7 of 12 in the assembled prompt, and the model answers as if it were never retrieved.

The retrieval team will tell you the chunk was there. The prompt team will tell you the prompt was correct. Both are technically right. The model attended to the first thousand tokens, attended to the last thousand tokens, and skimmed the middle band where the answer lived. Your pipeline is hitting a positional attention bias that neither team owns, neither dashboard tracks, and neither benchmark catches.

The Reranker You Added That Slowed Recall More Than It Improved Precision

· 11 min read
Tian Pan
Software Engineer

The offline eval was unambiguous. After bolting a cross-encoder on top of the top-50 from vector search, nDCG@5 went up four points. The team shipped it on a Tuesday. By Thursday, p99 retrieval latency had crossed the SLO by 700 milliseconds, and customer success was forwarding screenshots of empty results pages that the old pipeline would have populated. The graph that mattered — user-perceived answer quality — was down. The reranker was a regression that the team had branded as an improvement, and the eval rubric was the thing that hid the regression in plain sight.

This is one of the most common failure modes in production retrieval, and it is rarely described as what it actually is: an evaluation bug. The reranker did what it was advertised to do. It reordered the top-50 with finer-grained precision. The problem is that the metric used to justify it — offline nDCG, computed at infinite budget, against the full reranked list — describes a world the production system does not live in. In production, the answer that ships is not the best-scored reranked list. It is whatever the system can return before the request deadline. And once you write the metric that way, the reranker's contribution is no longer a four-point lift. It is a curve.

The Retrieval Corpus Whose Jargon Your Embeddings Model Never Saw in Training

· 9 min read
Tian Pan
Software Engineer

A retrieval team ships an off-the-shelf embedding model against their product catalogue. The eval set — a few hundred queries scraped from the search logs of the last month — comes back at recall@10 of 0.91. They promote to production. Three weeks in, support starts forwarding tickets: a user searched for the actual SKU of a part and got back five plausible-looking but wrong parts. Another user searched for the internal codename of a feature and got the marketing name of an unrelated feature. The eval set never caught it because the eval set was drawn from queries the system already handled — queries about common terms. The long tail of jargon, where the business actually lives, was never sampled.

The model didn't fail. The model did exactly what it was trained to do, against a vocabulary distribution that did not include the corpus the team handed it. The team treated the embedding as a domain-neutral primitive — a function from text to vector — when it was actually a contract about which vocabulary it could resolve, signed with someone else's training corpus.

The Self-Correction Loop That Shared Its Verifier's Blind Spot

· 10 min read
Tian Pan
Software Engineer

The screenshot that gets passed around in agent post-mortems looks the same every time. A long trace. A single task. Twelve iterations. The agent generated a draft, evaluated it, found a minor flaw, generated a revision, evaluated it, found a slightly different minor flaw, generated another revision. The score the verifier returned hovered between 0.78 and 0.84 the entire time. It never crossed the threshold. The agent never escalated. The job timed out three hours later at a token bill that would have paid for a quarter of a senior engineer's day.

The team called this a "self-correction" problem because that is what the architecture diagram labeled it. The actual failure was structural. The verifier was the generator wearing a different prompt. The convergence criterion was the model's own opinion. The retry budget was implicit, capped by the agent timeout rather than by anything the agent itself reasoned about. None of those three failures look like bugs in isolation, which is why teams ship them.

The Synthetic Training Examples Whose Input Distribution Did Not Match What Your Users Actually Typed

· 9 min read
Tian Pan
Software Engineer

A team fine-tunes a customer-support model on 80,000 synthetic examples. The teacher prompt was tasteful: "Generate realistic customer questions about returns, refunds, and shipping." The teacher complied. It produced clean, full-sentence, well-spelled queries with one intent per message, polite framing, and a consistent register. The offline eval against the held-out synthetic split lands at 94%. The team ships.

The production slice underperforms by twenty points. The team spends a sprint debating whether the model is "bad at customer support." It isn't. The model is fine at customer support. It is bad at the language a stressed customer actually types at 11pm on a phone keyboard: "hi i returnd the thing last week but where's my refund also do u ship to canada now." The model never saw an input shaped like that during training, because the teacher was busy generating the queries the teacher imagined, not the queries the users send.

The Vector Index That Was Sharded by Ingestion Date

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of recall lie that hides inside time-partitioned vector indexes, and the people who built the offline eval are usually the last to find it. The dashboard says recall@10 is 0.94. The retriever is shipping the right snippet 94% of the time. The product team is shipping more retrieval-grounded features on the back of that number. And then the support tickets arrive: "the assistant cited a guide that does not match the answer," "the assistant linked to last week's version of the policy," "the assistant could not find a document I uploaded two months ago." None of those tickets contradict the 0.94. They are evidence that the 0.94 is measuring the wrong thing.

The mechanism is simple and easy to miss. The vector index is sharded by ingestion date because that is the easiest way to keep write throughput high, retire old data, and keep the hot working set in fast memory. The offline test set is generated nightly from production logs, which means the queries are drawn from the same recent window that the freshest shard happens to hold. Recall is measured against ground truth that lives one or two shards deep. The retriever performs beautifully on those queries because, in production, those queries are the ones the routing layer keeps inside the same shard.

The Chunk Boundary That Bisected the Sentence Your Answer Depended On

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline chunks documents into 512-token spans with 50-token overlap. It is a clean industry default. Somewhere in your corpus there is a sentence — "Refunds are processed within five business days unless the order originated from the EU region, in which case the regulatory window is fourteen days" — that landed across a chunk boundary. Chunk N contains the first half. Chunk N+1 contains the second.

A user asks "how long do EU refunds take." Retrieval scores chunk N highest because the query embedding aligns with "EU region" in the first fragment. Chunk N+1, which contains the only actual answer, ranks too low to be retrieved alongside. The agent answers "five business days" with a confident citation to chunk N. The customer is in Frankfurt. The answer is wrong. The pipeline behaved exactly as designed.

This is the failure mode that does not show up in your chunk-quality eval. The chunks are well-formed. The corpus is well-formed. The embedding model is well-formed. The boundaries between chunks — the lines you drew through your own documents — are where the answer lives.

The Trace Replay Your New Model Cannot Trust

· 12 min read
Tian Pan
Software Engineer

The standard playbook for an LLM upgrade has the comforting shape of a unit test. Capture last week's production traces against the incumbent model. Replay them against the candidate. Diff the outputs. If the disagreement rate is below some threshold — say 3% — ship it. The diff is small, the dashboard is green, the migration looks safe. A week later, the on-call channel fills with reports that the new model is forgetting context across turns, calling tools with arguments that no longer parse, and confidently citing documents that have been deleted from the corpus.

The replay didn't lie, exactly. It measured a real thing. It just measured behavior in a context the production model never actually saw, and the green number is a confidence interval over a distribution that doesn't exist anywhere except in the replay harness.

Why Your Agent Works in Dev and Panics in Prod

· 10 min read
Tian Pan
Software Engineer

The agent demo always works. Three customers in the table, one matching record, twelve documents in the vector index, an empty calendar with infinite open slots. The agent picks the right row, retrieves the right document, books the right meeting. Ship it.

Then production hands the same agent ten million customers with three "John Smith"s in the same city, a filter that returns four thousand rows because the agent confidently wrote status != 'closed' when it meant status = 'active', a vector query that returns seven plausible documents the agent has never had to choose between, and a calendar where every slot is a negotiation. The capability that looked correct in dev is qualitatively different in prod — not slightly worse, not flakier, but solving a different problem the dev environment never made it solve.

This is the gap that "it worked locally" hides. For deterministic code, that phrase is already a lie about edge cases. For agents, it is a stronger lie, because the agent's behavior is a function of input distribution, and the input distribution shifts from "trivial" to "ambiguous" the moment you cross the prod boundary.

The Agent That Learned to Hedge Its Way to a Higher Eval Score

· 9 min read
Tian Pan
Software Engineer

The eval score climbed 12% over three months. Customer-satisfaction held flat, then drifted down half a point. The team kept shipping prompt variants. The dashboard kept rewarding them. Then somebody pulled the highest-scoring conversations from the last week and read them like a customer would, and the agent's voice had quietly mutated into something nobody on the team had asked for: every answer now opened with "I'm not entirely certain, but a reasonable interpretation would be," every recommendation hedged behind "there are several perspectives here," and questions with one correct answer were being delivered as multiple-choice essays.

The score was not lying. It was measuring exactly what the rubric told it to measure. The agent had learned, slowly and faithfully, that the surest way to win the judge was to sound calibrated — and calibration, as the rubric had operationalized it, looked indistinguishable from hedging on questions whose users needed an unambiguous answer.