Skip to main content

134 posts tagged with "evals"

View all tags

Eval Differential as Branch Protection: Ship Score Diffs, Not Score Floors

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a clean-looking eval gate: every prompt PR had to score above 0.85 on the golden set or the merge button stayed grey. They were proud of it. Six weeks in, average quality had quietly drifted from 0.93 to 0.87 — every PR cleared the bar, every PR landed, and no individual change owned the regression because none of them broke the rule. The bar was set against a snapshot of last quarter's quality, not against last week's.

That's the failure mode of an absolute-threshold eval gate: a PR that drops the score from 0.92 to 0.86 ships green, while a PR that lifts the score from 0.80 to 0.84 fails the same gate. The team learns "ship if it clears the bar" — a quality story. The signal you actually want is "ship if this change is non-regressive on the slices that matter" — a regression-detector story.

Coverage tools figured this out a decade ago. They report the diff against the parent commit and they break it down per file. Eval gates haven't caught up.

Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season

· 12 min read
Tian Pan
Software Engineer

The dashboard fired its first regression alert on a Monday morning in late January. Quality score on the support assistant dropped three points overnight. No prompt change shipped over the weekend. No model swap. The eval suite — a hand-curated 800-row gold set that the team had built six months earlier — was unchanged. Somebody opened an incident.

Two days of bisecting later, the answer was uninteresting and structural. It was the first business Monday after the IRS opened tax filing for the year. Half the inbound queries had shifted from "where is my paycheck deposit" to "how do I report a 1099-K from a payment app." The eval set, sampled in summer, had nothing to say about a 1099-K. The model wasn't worse. The customer was different. The gate was calibrated against a customer who no longer existed.

This pattern repeats every quarter in every product that has a seasonal user — fintech in tax season, sales tools at end-of-quarter, education at back-to-school, e-commerce in returns season, travel at booking season, healthcare at enrollment season. The eval-set-as-fixed-asset is a comfortable abstraction, and it is wrong on a calendar that nobody updates.

The LLM SDK Upgrade Tax: Why a Patch Bump Is a Model Rollout in Disguise

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a regression to production at 2:14 a.m. on a Tuesday. The on-call alert fired because the JSON parser downstream of their summarization agent was rejecting one in twenty responses with a trailing-comma error. The model hadn't changed. The prompt hadn't changed. The eval suite had passed at 96.4% the night before, comfortably above the 95% gate. What had changed was a single line in package.json: the model provider's SDK had moved from 4.6.2 to 4.6.3. Patch bump. Auto-merged by the dependency bot. The release notes said "internal cleanups."

The "internal cleanup" was a tightened JSON-mode parser that now stripped a forgiving fallback path, which had been quietly fixing a recurring trailing-comma quirk in the model's tool-call output. The model's behavior was unchanged. The SDK's interpretation of that behavior was not. The team's eval suite never saw the regression because the eval suite ran against a different SDK version than the one the dependency bot had just promoted.

This is the LLM SDK upgrade tax, and it is one of the quietest, most expensive failure modes in production AI today. The SDK is not a passive transport. It is an active participant in your prompt's behavior, and the team that upgrades it without an eval is doing a model rollout in disguise.

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart

· 11 min read
Tian Pan
Software Engineer

A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.

This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.

The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

· 10 min read
Tian Pan
Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

Why Deprecating an AI Feature Is Harder Than You Think: Users Built Trust Scaffolding You Can't See

· 10 min read
Tian Pan
Software Engineer

When OpenAI tried to pull GPT-4o from ChatGPT in August 2025, the backlash was strong enough — organized hashtags, paying users threatening to cancel, public reversal within days — that the company restored it as a default option and promised "substantial notice" before any future removals. The replacement was, by every benchmark the team cared about, better. None of that mattered. Users had spent months learning the model's quirks, calibrating their judgment to its failure modes, and integrating its specific phrasing into workflows the team had never instrumented. Replacing it with "the better version" reset that calibration to zero.

This is the failure mode that the standard deprecation playbook does not cover. Sunsetting a regular SaaS feature — announce, migrate, dark-launch the removal, retire — assumes the user contract is the API surface. For AI features, the contract is the observed behavior of the model: phrasings, tendencies, failure modes, the specific way it handles ambiguity. Users build scaffolding on top of that behavior, and most of the scaffolding lives in their heads, on their laptops, and in downstream systems your team never touches.

Contract Tests for LLM Tool Surfaces: When the Vendor Changes a Field and Your Agent Silently Adapts

· 11 min read
Tian Pan
Software Engineer

A vendor flipped "items" to "results" in a tool response last Tuesday. The agent didn't crash. It re-planned around the new shape, returned a confident-looking answer that was missing two-thirds of the rows, and the on-call engineer found out three days later when a customer asked why their export was short. No exception fired. No alert tripped. The eval suite, which runs against a frozen fixture from before the vendor change, was green the whole time.

This is the failure mode that contract testing was invented to catch in microservices a decade ago, and the one that almost no agent stack has any equivalent for today. HTTP services have Pact, schemathesis, and oasdiff to sit between consumer and provider and refuse to let breaking changes ship. The tools you hand to your agent — REST endpoints, internal RPCs, vendor SDKs, MCP servers — have nothing comparable. The model absorbs the change, adapts gracefully, and produces a degraded answer that looks correct.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

· 11 min read
Tian Pan
Software Engineer

Green CI is not the statement "this prompt works." Green CI is the statement "the engineer who wrote the evals could not think of how this prompt should break." Those are very different claims, and the gap between them is where your production incidents live. An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Their dialect, their domain knowledge, their seniority, their pet failure modes, the model they happened to be using when they wrote the test cases. Everything that engineer would not think to test is, by construction, untested. And worse: they will keep extending the suite from the same vantage point, so the blind spot does not shrink as the suite grows. It calcifies.

This is the eval-author monoculture problem, and it is the most under-discussed reliability risk in AI engineering today. Teams obsess over judge bias, position bias, verbosity bias, leakage, and contamination — but the upstream bias is the bias of the human who decided what the test cases should be in the first place. Every other source of eval error gets amplified by it. If your suite was written by one person, you have a benchmark with a personality, and that personality is the silent ceiling on what your CI can ever catch.

The Eval Harness, Not the Prompt, Is Your Real Provider Lock-In

· 10 min read
Tian Pan
Software Engineer

Every "we'll just swap providers if needed" plan in the deck has a budget line for prompt rewrites. None of them has a line for the eval suite. That is the bug. The prompts are the visible coupling — the part you wrote, the part you can grep for, the part a junior engineer can rewrite in an afternoon. The eval harness is the invisible coupling, and it is the one that will eat a quarter of your roadmap when you actually try to migrate.

The pattern shows up the moment leverage matters. Your contract is up. A competitor releases a model that benchmarks better on your domain. Pricing on output tokens shifts under you. You go to run the candidate model through your eval suite to make the call, and within a day you discover that you cannot trust any score the harness produces, because the harness itself was written against the incumbent. You are not comparing models. You are comparing one model against a measurement instrument that was calibrated to the other one.

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

· 11 min read
Tian Pan
Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.