168 posts tagged with "evaluation"

Prompt Linting Is the Missing Layer Between Eval and Production

April 28, 2026 · 11 min read

Software Engineer

The incident report read like a unit-test horror story. A prompt edit removed a five-line safety clause as part of a "preamble cleanup." Every eval in the suite passed. Every judge score held within tolerance. Two weeks later, a customer-facing assistant produced a response that should have been refused, the kind that triggers a Trust & Safety page at 11pm. The post-mortem traced the regression to a single deletion in a PR that nobody had flagged because the suite that was supposed to catch regressions had no opinion on whether the safety clause was present — it only had opinions on whether the model behaved well in the cases the suite remembered to ask about.

This is the gap between behavioral evals and structural correctness. Evals measure what the model produces; they do not measure what the prompt is. And prompts, like code, have a structural layer that exists independently of behavior — sections that must be present, references that must resolve, variables that must interpolate, length budgets that must hold, deprecated identifiers that must not appear. When that structural layer breaks, the behavior often stays green for a while, until the right edge case in production surfaces the failure as an incident.

Prompt Position Is Policy: The Silent Merge Conflict When Three Teams Co-Own a System Prompt

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The diff in your prompt repo says three lines changed. The behavioral diff in production says everything changed. The safety team moved a refusal rule from line 14 to line 87 to "group it with related guardrails," the product team didn't notice because the wording was identical, and a week later the eval suite is showing a 9-point drop on adversarial inputs. Nobody edited the rule. Somebody moved it. In a 2,400-token system prompt with primacy bias on guardrails and recency bias on instruction-following, moving a rule is a behavioral change as load-bearing as rewriting it — and your tooling surfaces neither.

This is the merge-conflict pattern that AI teams discover at the end of a regression review, not the beginning of one. The system prompt grew past 2K tokens sometime in late 2025. The safety team owns the top, the product team owns the middle, the agent team owns the bottom, and three months of "small edits" have silently rearranged everyone else's intent because the line-based diff tool that worked fine for code can't tell you that an instruction crossed a section boundary. The bug isn't in any single edit. The bug is that position is now policy, and you have no policy on position.

The Reranker Is the Silent Second Model Your RAG Eval Never Measures

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

A typical RAG pipeline ships with two models, not one. The retriever pulls 50 to 100 candidates from the vector store, and a reranker — a cross-encoder, an LLM-as-judge prompt, or a hybrid — re-scores those candidates and hands the top 5 to the answer model. Your eval suite measures end-to-end answer quality. It measures retriever recall@k. It does not measure the reranker. So when the reranker quietly drifts, the dashboard renders "answer quality dropped 4 points" with no causal arrow, and the team spends three days debugging a prompt that is not the problem.

The reranker is the silent second model. It sits between the retriever and the generator, it has its own scoring distribution, its own prompt (if it's LLM-based) or its own weights (if it's a cross-encoder), and it can regress independently of every other component. Most teams never grade it in isolation. The eval suite they wrote treats the pipeline like one model with a long context window, when it's actually two models in series with an interface neither team owns.

Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.

The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.

Vendor Benchmarks Are Your Ceiling, Not Your Forecast

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

The model release announcement lands on Tuesday morning. The blog post leads with a chart: HumanEval up four points, SWE-bench Verified up six, MATH up three, the agent harness du jour up a number that would have been a research paper a year ago. By Tuesday afternoon there is a Slack thread inside your company with screenshots of the chart and a question shaped like a decision: "Should we cut over?" The thread treats the benchmark delta as a forecast — as if those numbers describe what the new model will do for your product, on your prompts, in your tool harness, against your eval rubric. They do not. The vendor's number is the upper bound on what you might see. Your realized lift is somewhere between zero and roughly half of that headline, and you cannot know which without running an eval the vendor did not run.

This is not a complaint about benchmark validity. The benchmarks are real. They are run against real eval suites. The vendor is not lying. The problem is that the vendor's harness is an idealized environment that strips away every variable a production deployment introduces, and a number generated under those conditions is structurally incapable of predicting behavior under yours. Treating it as a prediction is a category error — and it leads to procurement decisions, capacity-planning commitments, and rollout schedules that are calibrated against a fiction.

The AI Engineer Interview Is Broken: Stop Testing Implementation, Start Probing Eval-Design

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I worked with last quarter rejected three candidates in a row from their AI engineer pipeline. All three failed the coding screen — the kind of problem where you implement a sliding-window deduplicator under a 35-minute timer. The team then hired the candidate who passed it. Four months later that engineer was the one who shipped the feature where the eval scored 92% in CI and the support queue lit up the day after launch. The eval was measuring exact-match against a curated test set. Production users phrased their queries differently. Nobody on the hiring panel had asked the candidate how they would have caught that gap.

That's the shape of the bug. The interview pipeline was screening for the skill that mattered least to the job and was blind to the skill that mattered most. The team did not have a "judgment" round. They had a coding round, a system-design round, and a behavioral round, and they were running the same loop they had run in 2021 — the one calibrated for engineers who were going to write deterministic code against stable libraries.

Calibrated Abstention: The Capability Every Layer of Your LLM Stack Punishes

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

There is a capability your model could have that would, on the days it mattered, be worth more than any other behavioral upgrade you could ship: the ability to say "I don't have a reliable answer to this" and mean it. Not the keyword-matched safety refusal. Not the hedging tic the model picked up from RLHF on controversial topics. The real thing — a calibrated abstention that fires when, and only when, the model's internal evidence does not support a confident response.

You will never get it by accident. Every default in the LLM stack pushes the other way.

The Eval-Rig Latency Lie: Why Your p95 Doubles in Production

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

The eval team puts a number on the deck: "p95 latency is 1.2s." The launch ships. A week later, oncall posts a graph: production p95 is 4.8s and climbing through the dinner-time peak. Engineers spend the next five days arguing about whether something regressed, instrumenting model versions, opening tickets with the provider — and eventually discover that nothing changed except where the number was measured. The eval rig was reporting the latency of a quiet machine running serial calls against a warm cache. Production is a different system. The p95 was never wrong; it was answering a different question.

This is the eval-rig latency lie. It is not about bad benchmarks — most teams use reasonable tools and report the numbers honestly. It is about the gap between "the latency of the model" and "the latency a user experiences," and the fact that the rig you build for development almost always measures the first while implying the second. Once you internalize this, latency SLOs derived from a benchmark stop looking like product commitments and start looking like claims about a private testing environment that nobody else can reproduce.

Your Eval Rubric Is the Real Product Spec — and No PM Signed Off on It

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

A product manager writes a paragraph: "The assistant should be helpful, accurate, and concise, and should never make the customer feel rushed." An engineer reads that paragraph, opens a YAML file, and writes 47 weighted criteria so the LLM-as-judge can produce a number on every trace. Six months later, that YAML file is the actual specification of the product. Every release is gated on it. Every regression alert fires on it. Every "this is shipping quality" decision routes through it. The PM has never read it.

This is the most common form of unintentional product ownership transfer in AI engineering today. The rubric is not a measurement of the spec — it is the spec, in the same way that a compiler is not a description of your language but the operational truth of it. And like compilers, rubrics have implementation details that silently determine semantics. Which failure mode gets a 0 versus a 0.5? Which criteria is weighted 0.3 versus 0.05? Which behavior is absent from the rubric and therefore goes uncounted entirely? Each of these is a product decision. None of them lived in the original brief.

Few-Shot Rot: Why Yesterday's Examples Hurt Today's Model

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I worked with had a JSON-extraction prompt with eleven hand-tuned few-shot examples. On the previous model, those examples lifted exact-match accuracy by six points. After the model upgrade, the same eleven examples dragged accuracy down by two. Nobody changed the prompt. Nobody changed the eval set. The examples simply stopped working — and worse, started actively misdirecting.

That regression is not a bug in the new model. It is a rot pattern in the prompt itself, and it shows up every time a team migrates between model versions while treating the prompt as a fixed asset. Few-shot examples are not part of the prompt. They are part of the model-prompt pair. Migrating one without re-evaluating the other produces a regression that no eval suite tied to a single model version will catch.

Your LLM Judge Has a Length Bias, a Position Bias, and a Format Bias — and Nobody Is Auditing Yours

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

A team I worked with last quarter watched their LLM-as-judge score climb from 78% to 91% over six weeks of prompt iteration. They shipped. Users hated it. The new prompt produced longer, more formatted, more confident-sounding answers — and the judge loved every one of them. The team had not built a smarter prompt. They had reverse-engineered their judge's biases.

This is the failure mode nobody on the team is auditing. LLM-as-judge has well-documented systematic biases: longer answers score higher regardless of quality, the first option in pairwise comparisons wins more often than chance, and outputs that look like the judge's own training distribution outscore outputs that do not. If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator.

The depressing part is that the audit methodology to catch this is straightforward, the calibration discipline that prevents it is cheap, and almost no team runs either.

Your Model Router Was Trained on Your Eval Set, Not Your Traffic

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A team I talked to last quarter shipped a model router that scored 96% routing accuracy on their offline benchmark and cut average inference cost by 58%. Three weeks in, support tickets started clustering around a specific user segment — enterprise admins running scripted bulk queries through their API. The cheap path was sending those users garbage answers. The router was working exactly as designed. The design was wrong.

That story is the rule, not the exception. The "send small-model what you can, save big-model for what you must" architecture is one of the most reliable cost levers in production LLM systems, with documented savings between 45% and 85% on standard benchmarks. But the savings number that gets quoted on every routing demo assumes a benchmark distribution. Production traffic doesn't have that shape, and the gap between the two is where quality regressions live — concentrated in segments your offline eval was never designed to surface.

About Tian Pan