Skip to main content

143 posts tagged with "evals"

View all tags

The Citation Index Your Chunker Shifted by One When It Started Prefixing Line Numbers

· 11 min read
Tian Pan
Software Engineer

The chunker started prepending [line N] to every chunk. The eval went green. Every citation the model produced after that day pointed to the paragraph one position before the actual evidence, on every document, in the regulated industry the product serves. The team did not find out from the eval. The team found out from an auditor who looked at the cited sentence, read it, and pointed out that it contradicted the claim it was supposed to support.

This is the kind of regression that survives a code review, a manual QA pass on three sample documents, and a feature-flag rollout. None of those checks were wrong in isolation. They were all asking the same question — does a citation appear where one is expected — and none of them were asking the question the auditor asked, which is whether the citation points at the sentence the claim came from. The gap between those two questions is where the off-by-one lived for as long as it lived.

What makes this failure mode worth a separate write-up is not the bug itself. Off-by-one errors are old news. The interesting part is that the failure was produced by two systems that continued to agree on the structure of an integer while silently disagreeing about what the integer meant.

The Eval Set Your Prompt Engineers Turned Into Production Few-Shots

· 11 min read
Tian Pan
Software Engineer

The eval dashboard had been climbing for three sprints. Quality up six points on the hard slice, up nine on the regression slice, up twelve on the slice the support team had hand-curated from last quarter's worst tickets. The team shipped a model promotion off the back of it. Two days later, a customer asked a question that looked nothing like anything in the eval set, and the answer was worse than what they had been getting six months ago.

The forensic was quick once someone thought to run it. The prompt engineers had been working out of the same repo as the eval team. They had found the curated examples — the painstaking ones, the ones where someone had argued for an hour about the correct phrasing of the ideal answer — and over a few sprints they had copy-pasted the strongest of them as few-shot demonstrations into the production system prompt. The dashboard kept going up because the model was being graded on inputs it had seen verbatim at inference time. Nobody flagged it. Nobody owned the boundary between "the examples we measure quality against" and "the examples we ship in the prompt." Both teams were doing exactly the job they had been hired to do.

The Fallback Model Whose System Prompt Was Tuned for Someone Else

· 10 min read
Tian Pan
Software Engineer

Your reliability dashboard says 99.95%. Your support inbox says something else. Twice a week, for ten or twenty minutes at a time, a thin sliver of users gets a version of your product that talks like a different company. The refusals read funny. A structured field that always rendered as a tidy two-column card now shows up as a paragraph with bullet points smashed inside it. Tone shifts from "calm expert" to "eager assistant." Nobody opens a ticket — they just close the tab and try again later.

Your provider went down. The failover worked. Latency stayed under SLO. The error budget did not move. And the experience your users got during that window was not the one you ship.

The mental model most teams carry into multi-provider architecture is that the system prompt is portable — a contract negotiated with the abstract idea of "a capable model," readable by anyone who speaks the LLM dialect. That model is wrong. A system prompt is a tuned artifact. It is tuned against a specific model's preferences, refusal grammar, formatting habits, and instruction-following biases. When the failover engages, you are not handing the same contract to a comparable counterparty. You are handing a contract written in your primary's idiom to a model that reads a different idiom and signs it anyway.

The Few-Shot Example Your Model Treated as Binding Precedent

· 10 min read
Tian Pan
Software Engineer

A user submits a question. Your model produces an answer that is confidently wrong in a very specific way: the format is perfect, the reasoning is well-structured, and a particular qualifier — one that does not apply to this question at all — appears in exactly the place a similar qualifier appeared in example three of your system prompt. Not a hallucination. Not a prompt injection. The model did precisely what the examples taught it to do, on a question those examples were never meant to cover.

This is the failure mode that few-shot prompting actively encourages and that most eval suites are structurally blind to. Your examples are not neutral demonstrations of "what good looks like." They are case law. The model selects the closest match by surface tokens and applies the precedent — including its constraints — to whatever case is in front of it.

The Model Identifier Your Provider Re-Pointed to a Finetune for One Tenant and to Base for Everyone Else

· 11 min read
Tian Pan
Software Engineer

A customer support team escalates: "Your assistant used to handle refund-eligibility questions correctly. Last week it started getting them wrong." The on-call engineer pulls a transcript, replays the exact prompt against the same model identifier in a dev account, gets the correct answer, and closes the ticket as "cannot reproduce." Two weeks later the same complaint shows up from a different customer. The engineer replays again, in the same dev account, and gets the correct answer again. The team starts blaming a prompt change nobody made.

The model identifier in the request never changed. The string in the response field matched the string in the request field. The eval suite stayed green for six weeks. The model serving production traffic was a different set of weights from the model serving the eval suite, and had been for the entire life of the account — except for the last six weeks, when it became the same set of weights and the team noticed only because a customer noticed first.

The Prompt Engineer Who Quietly Became Your Only Eval Set Reader

· 8 min read
Tian Pan
Software Engineer

The eval set is a file. It is also, secretly, a theory of what the AI feature is for. The two are not the same thing, and the team that confuses them has built a quality gate whose calibration depends on a single human's working memory. When that human leaves, the file stays and the theory walks out the door.

This is the failure mode you don't see in the org chart. You scoped a prompt engineering role. You hired someone good. They shipped the v1 prompts, looked at the thin benchmark, and rewrote it into something rich — a taxonomy of failure modes, weights per category, a labeling rubric that disambiguates edge cases. The eval set became the contract for "is this model good enough to ship." Six quarters later you discover that the contract is unreadable by anyone except the person who wrote it.

The Prompt Version Your Team Treated as Independent of the Model Version

· 9 min read
Tian Pan
Software Engineer

Your incident timeline says "no deploys in the last 72 hours." Your prompt registry agrees: prompt v37 has been frozen for three weeks. Your eval harness ran clean on Tuesday night. But on Wednesday morning, the structured-output failure rate on one of your agents tripled, the retry budget on another doubled, and a third started cheerfully ignoring an instruction it had been honoring for a month. Nothing changed. Except something did change, and it changed in the place neither side of the org was watching: the model.

The prompt registry knows about prompt versions. The model gateway knows about model versions. Almost nobody, in practice, tracks the pair. And prompt v37 isn't a free-standing artifact — it is, whether your tooling admits it or not, a contract negotiated against one specific model. When the platform team rolls the claude-sonnet-latest alias forward by a single point release, the contract on the other side has been quietly amended, and your incident timeline reads "no deploys" because the deploy happened on someone else's infrastructure under a name that didn't move.

The Silent Personalization Layer Your Customers Could Not Reproduce

· 11 min read
Tian Pan
Software Engineer

A platform team ships a quality improvement. An inference-time layer reads the user's recent interactions and silently nudges the response style: more formal here, more terse there, more technical when the history suggests an engineer is asking. The A/B test shows an aggregate satisfaction lift of a couple of points. The launch post goes out under the heading "smarter responses, no API changes required." Nobody flips a flag in the API. Nobody updates the docs. Nothing in the response payload indicates which persona the model just adopted.

Six weeks later an enterprise customer files a support ticket that says, "your model is worse than you advertised." Their internal eval suite — running the same prompts your team published benchmarks against — scores eight points lower. Your team's first move is to verify prompt parity. Prompts match exactly. Decoding parameters match. The model version string matches. The divergence traces to the personalization layer, which infers a "thin-history default persona" for the customer's freshly-provisioned test account and a richer one for the long-lived user accounts your benchmarks were measured against. The conversation about whether the personalization is a feature or a bug stops being a product decision and becomes a contract negotiation.

The Summarizer That Paraphrased Away the User's Literal Question

· 8 min read
Tian Pan
Software Engineer

A user asks: "Does this qualify as a 'transfer' under article 28?" Forty turns later, the model gives an answer to a different question. The transcript shows the model answered the question it was given. The user is reading a complaint that reads like a hallucination. Both are right. The model never saw the user's question — it saw your summarizer's polite translation of it: "user asked about article 28 applicability."

The word "transfer" was the question. The summarizer threw it away because the summarizer's loss function was tuned to preserve facts, not wording, and the rubric never learned the difference between paraphrasing the topic and paraphrasing the constraint. Topic was preserved. Constraint became fog.

This failure mode is structural, not anecdotal. Any application that compresses long conversations with a model-generated summary has a second model in the critical path — one whose quality contract is usually treated as a token-budget knob rather than as a piece of product logic. That asymmetry is where the bug lives.

The Annotation Queue Your Humans Quietly Stopped Reading

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.

This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.

The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed

· 9 min read
Tian Pan
Software Engineer

A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.

Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.

The Distillation That Lost a Capability Your Eval Suite Never Measured

· 9 min read
Tian Pan
Software Engineer

A team shrinks a 200B teacher into a 7B student because the eval suite — fifty thousand examples covering everything the product launched with — shows the student trailing the teacher by less than two points and inference cost dropping by an order of magnitude. The migration ships. The cost graph drops. The customer-satisfaction graph holds. Three weeks later, support starts seeing a class of failures the team cannot reproduce in eval.

The student no longer recognizes a corner-case input format the teacher had silently handled. It no longer recovers from a particular ambiguous instruction the teacher had reliably disambiguated. It no longer produces the rare-but-load-bearing "ask a clarifying question instead of guessing" behavior — because the eval set was scrubbed of ambiguous prompts on the grounds that they were "bad data."

The eval said the distillation was faithful. The eval was wrong about what faithfulness means.