Skip to main content

92 posts tagged with "evaluation"

View all tags

Vendor Benchmarks Are Your Ceiling, Not Your Forecast

· 10 min read
Tian Pan
Software Engineer

The model release announcement lands on Tuesday morning. The blog post leads with a chart: HumanEval up four points, SWE-bench Verified up six, MATH up three, the agent harness du jour up a number that would have been a research paper a year ago. By Tuesday afternoon there is a Slack thread inside your company with screenshots of the chart and a question shaped like a decision: "Should we cut over?" The thread treats the benchmark delta as a forecast — as if those numbers describe what the new model will do for your product, on your prompts, in your tool harness, against your eval rubric. They do not. The vendor's number is the upper bound on what you might see. Your realized lift is somewhere between zero and roughly half of that headline, and you cannot know which without running an eval the vendor did not run.

This is not a complaint about benchmark validity. The benchmarks are real. They are run against real eval suites. The vendor is not lying. The problem is that the vendor's harness is an idealized environment that strips away every variable a production deployment introduces, and a number generated under those conditions is structurally incapable of predicting behavior under yours. Treating it as a prediction is a category error — and it leads to procurement decisions, capacity-planning commitments, and rollout schedules that are calibrated against a fiction.

The AI Engineer Interview Is Broken: Stop Testing Implementation, Start Probing Eval-Design

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter rejected three candidates in a row from their AI engineer pipeline. All three failed the coding screen — the kind of problem where you implement a sliding-window deduplicator under a 35-minute timer. The team then hired the candidate who passed it. Four months later that engineer was the one who shipped the feature where the eval scored 92% in CI and the support queue lit up the day after launch. The eval was measuring exact-match against a curated test set. Production users phrased their queries differently. Nobody on the hiring panel had asked the candidate how they would have caught that gap.

That's the shape of the bug. The interview pipeline was screening for the skill that mattered least to the job and was blind to the skill that mattered most. The team did not have a "judgment" round. They had a coding round, a system-design round, and a behavioral round, and they were running the same loop they had run in 2021 — the one calibrated for engineers who were going to write deterministic code against stable libraries.

Calibrated Abstention: The Capability Every Layer of Your LLM Stack Punishes

· 11 min read
Tian Pan
Software Engineer

There is a capability your model could have that would, on the days it mattered, be worth more than any other behavioral upgrade you could ship: the ability to say "I don't have a reliable answer to this" and mean it. Not the keyword-matched safety refusal. Not the hedging tic the model picked up from RLHF on controversial topics. The real thing — a calibrated abstention that fires when, and only when, the model's internal evidence does not support a confident response.

You will never get it by accident. Every default in the LLM stack pushes the other way.

The Eval-Rig Latency Lie: Why Your p95 Doubles in Production

· 10 min read
Tian Pan
Software Engineer

The eval team puts a number on the deck: "p95 latency is 1.2s." The launch ships. A week later, oncall posts a graph: production p95 is 4.8s and climbing through the dinner-time peak. Engineers spend the next five days arguing about whether something regressed, instrumenting model versions, opening tickets with the provider — and eventually discover that nothing changed except where the number was measured. The eval rig was reporting the latency of a quiet machine running serial calls against a warm cache. Production is a different system. The p95 was never wrong; it was answering a different question.

This is the eval-rig latency lie. It is not about bad benchmarks — most teams use reasonable tools and report the numbers honestly. It is about the gap between "the latency of the model" and "the latency a user experiences," and the fact that the rig you build for development almost always measures the first while implying the second. Once you internalize this, latency SLOs derived from a benchmark stop looking like product commitments and start looking like claims about a private testing environment that nobody else can reproduce.

Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It

· 11 min read
Tian Pan
Software Engineer

A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.

This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

Your LLM Judge Has a Length Bias, a Position Bias, and a Format Bias — and Nobody Is Auditing Yours

· 11 min read
Tian Pan
Software Engineer

A team I worked with last quarter watched their LLM-as-judge score climb from 78% to 91% over six weeks of prompt iteration. They shipped. Users hated it. The new prompt produced longer, more formatted, more confident-sounding answers — and the judge loved every one of them. The team had not built a smarter prompt. They had reverse-engineered their judge's biases.

This is the failure mode nobody on the team is auditing. LLM-as-judge has well-documented systematic biases: longer answers score higher regardless of quality, the first option in pairwise comparisons wins more often than chance, and outputs that look like the judge's own training distribution outscore outputs that do not. If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator.

The depressing part is that the audit methodology to catch this is straightforward, the calibration discipline that prevents it is cheap, and almost no team runs either.

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.

Multimodal Eval Drift: Why Your Image and Audio Paths Regress While Text Stays Green

· 11 min read
Tian Pan
Software Engineer

The dashboard says quality is up two points this release. The text-eval suite ran clean. Your model provider shipped a new checkpoint that beats the prior one on every public benchmark you track. You roll forward. A week later the support team flags a quiet but persistent uptick in tickets about uploaded screenshots — users say the model is "reading the wrong numbers from the chart" or "missing a row in the table." Audio transcription complaints follow a few days later, mostly from non-American English speakers. None of it shows up in your eval pipeline. The release looks healthy. It isn't.

This is multimodal eval drift, and almost every team that bolted vision and audio onto a text-first stack is shipping it. The eval discipline that worked for text — gold sets, LLM-as-judge, drift dashboards, an aggregate score that gates the release — extends to multimodal in name only. The failure rates per modality are not commensurable, the rubrics that catch text errors don't catch image errors, and the labeling pipeline that produced your text gold set is calibrated to a workload that ships every six months, not to a multimodal regression that arrives with every checkpoint update.

The right mental model is that multimodality is not a flag on the same model — it is a different product surface with a different failure distribution, and the eval discipline that ignored that distinction is shipping silent regressions every model release.

Reachability Analysis for Agent Action Spaces: Eval Coverage for the Branches You Never Tested

· 12 min read
Tian Pan
Software Engineer

The first time anyone on your team learned that the agent could call revoke_api_key was the morning a well-meaning user typed "this token feels old, can you rotate it for me?" The tool had been registered six months earlier as part of a batch import from the auth team's MCP server. It had passed schema validation, appeared in the catalog enumeration, and then sat. No eval ever invoked it. No production trace ever touched it. Then one prompt, one planner decision, and the incident channel learned the tool existed.

This is the failure mode that hides inside every agent with a non-trivial tool catalog. Forty registered functions and a planner that can compose them produce a reachable graph of plans whose long tail you have never observed. The assumption that "we tested the common paths" papers over the fact that the dangerous branch is, almost by definition, the one you never saw.

Your Shadow Eval Set Is a Compliance Time-Bomb

· 10 min read
Tian Pan
Software Engineer

The most dangerous data store in your AI stack is the one nobody designed. It started with a Slack message during a sprint: "Real users are the only thing that catches real bugs — let's tap a percentage of production traffic into the eval pipeline so we can replay it nightly." Six engineers thumbs-upped the message. Nine months later, the bucket holds 4.3 million traces, an eval job pages the on-call when failure rates rise, and the failure cases are emailed verbatim to a Slack channel where forty people can read them. The traces include email addresses, internal company names, partial credit-card digits, employee phone numbers, and customer support transcripts where users explained why they were upset.

Nobody mapped the data flow. No DPIA covered it. The privacy review last quarter looked at the model vendor's API; it didn't look at your eval job. And then a data-subject deletion request arrives, and the team discovers that "delete this user's data everywhere" is a sentence that no longer maps to anything they can actually do.

Synthetic Users for Multi-Turn Agent Eval: When Your Test Fixture Has To Push Back

· 9 min read
Tian Pan
Software Engineer

Single-turn evals are great at one thing: ranking models on tasks where the user types once and walks away. They are useless for the failure modes you actually ship with. The agent that loses track of the user's goal by turn three. The agent that capitulates under polite repetition ("are you sure? could you check again?") and reverses a correct answer. The agent that asks the same clarifying question on turn four that it already asked on turn two, because it can't read its own history. None of these show up in a benchmark where the conversation ends after one exchange.

You can run real-user eval, but it costs hundreds of hours of human review per release and surfaces problems three weeks after they shipped. Or you can build LLM-driven synthetic users — bots with personas, goals, patience, and abandonment thresholds — and run thousands of conversations against a candidate agent every night. This is the approach behind τ-bench, AgentChangeBench, and most production-grade conversational eval setups in 2025–2026. It works, until it doesn't, and the ways it stops working tell you more about your eval pipeline than they do about the synthetic user.