Skip to main content

143 posts tagged with "evals"

View all tags

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart

· 11 min read
Tian Pan
Software Engineer

A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.

This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.

The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

· 10 min read
Tian Pan
Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

Why Deprecating an AI Feature Is Harder Than You Think: Users Built Trust Scaffolding You Can't See

· 10 min read
Tian Pan
Software Engineer

When OpenAI tried to pull GPT-4o from ChatGPT in August 2025, the backlash was strong enough — organized hashtags, paying users threatening to cancel, public reversal within days — that the company restored it as a default option and promised "substantial notice" before any future removals. The replacement was, by every benchmark the team cared about, better. None of that mattered. Users had spent months learning the model's quirks, calibrating their judgment to its failure modes, and integrating its specific phrasing into workflows the team had never instrumented. Replacing it with "the better version" reset that calibration to zero.

This is the failure mode that the standard deprecation playbook does not cover. Sunsetting a regular SaaS feature — announce, migrate, dark-launch the removal, retire — assumes the user contract is the API surface. For AI features, the contract is the observed behavior of the model: phrasings, tendencies, failure modes, the specific way it handles ambiguity. Users build scaffolding on top of that behavior, and most of the scaffolding lives in their heads, on their laptops, and in downstream systems your team never touches.

Contract Tests for LLM Tool Surfaces: When the Vendor Changes a Field and Your Agent Silently Adapts

· 11 min read
Tian Pan
Software Engineer

A vendor flipped "items" to "results" in a tool response last Tuesday. The agent didn't crash. It re-planned around the new shape, returned a confident-looking answer that was missing two-thirds of the rows, and the on-call engineer found out three days later when a customer asked why their export was short. No exception fired. No alert tripped. The eval suite, which runs against a frozen fixture from before the vendor change, was green the whole time.

This is the failure mode that contract testing was invented to catch in microservices a decade ago, and the one that almost no agent stack has any equivalent for today. HTTP services have Pact, schemathesis, and oasdiff to sit between consumer and provider and refuse to let breaking changes ship. The tools you hand to your agent — REST endpoints, internal RPCs, vendor SDKs, MCP servers — have nothing comparable. The model absorbs the change, adapts gracefully, and produces a degraded answer that looks correct.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

· 11 min read
Tian Pan
Software Engineer

Green CI is not the statement "this prompt works." Green CI is the statement "the engineer who wrote the evals could not think of how this prompt should break." Those are very different claims, and the gap between them is where your production incidents live. An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Their dialect, their domain knowledge, their seniority, their pet failure modes, the model they happened to be using when they wrote the test cases. Everything that engineer would not think to test is, by construction, untested. And worse: they will keep extending the suite from the same vantage point, so the blind spot does not shrink as the suite grows. It calcifies.

This is the eval-author monoculture problem, and it is the most under-discussed reliability risk in AI engineering today. Teams obsess over judge bias, position bias, verbosity bias, leakage, and contamination — but the upstream bias is the bias of the human who decided what the test cases should be in the first place. Every other source of eval error gets amplified by it. If your suite was written by one person, you have a benchmark with a personality, and that personality is the silent ceiling on what your CI can ever catch.

The Eval Harness, Not the Prompt, Is Your Real Provider Lock-In

· 10 min read
Tian Pan
Software Engineer

Every "we'll just swap providers if needed" plan in the deck has a budget line for prompt rewrites. None of them has a line for the eval suite. That is the bug. The prompts are the visible coupling — the part you wrote, the part you can grep for, the part a junior engineer can rewrite in an afternoon. The eval harness is the invisible coupling, and it is the one that will eat a quarter of your roadmap when you actually try to migrate.

The pattern shows up the moment leverage matters. Your contract is up. A competitor releases a model that benchmarks better on your domain. Pricing on output tokens shifts under you. You go to run the candidate model through your eval suite to make the call, and within a day you discover that you cannot trust any score the harness produces, because the harness itself was written against the incumbent. You are not comparing models. You are comparing one model against a measurement instrument that was calibrated to the other one.

The Eval-Set-as-Simulator Drift: When Offline Scores Improve and Production Gets Worse

· 11 min read
Tian Pan
Software Engineer

The most expensive failure mode in an LLM product is not a bad release. It is six consecutive good releases — by every internal scoreboard — while user trust quietly bleeds out. The offline eval score climbs every Friday demo. The CSAT line in the weekly business review goes flat, then dips, then nobody knows when it started dipping because nobody was triangulating the two charts. By the time a postmortem names it, the team has spent two quarters tuning a prompt against a dataset that stopped resembling reality somewhere around month three.

This is the eval-set-as-simulator drift, and it is the cleanest example I know of an old machine-learning lesson being rediscovered at full retail price by a generation of LLM teams who skipped the reading list. An eval suite is not a fixture. It is a simulator, and a simulator that is never re-calibrated against the system it claims to predict will eventually predict a different system.

When Your Evals Disagree: A Signal Hierarchy for the Week the Numbers Contradict Each Other

· 12 min read
Tian Pan
Software Engineer

It's Tuesday morning, the week after a prompt change shipped to half your traffic. You open four dashboards. The held-out golden set scored by the LLM judge says +8%. The human-rater panel that samples production weekly says no change. The A/B test on downstream conversion says −2%. The thumbs-up rate is flat. Four signals, four verdicts, and a standup in fifteen minutes where someone is going to ask whether you ship the prompt or roll it back.

The temptation is to pick the number that confirms what you already wanted to do — and the team will, because nobody on the call has a written rule for which signal wins. The disagreement isn't a measurement bug. It's the predictable output of a system that bolted four evaluators together without a hierarchy, and the cost of not having one is that every release week becomes a debate about whose number to trust.

The Inverted Agent: When the User Is the Planner and the Model Is the Step-Executor

· 12 min read
Tian Pan
Software Engineer

Most agent products today implement a simple bargain: the model decides what to do, the user clicks "approve." This is the right shape for low-stakes consumer chat — booking a restaurant, summarizing an inbox, drafting a casual reply. It is catastrophically wrong for legal drafting, financial advisory, medical triage, and incident response, where the user holds the accountability the model never can, and where the cost of the wrong plan dwarfs the cost of any individual step.

The inverted agent flips the polarity. The user composes the plan as a sequence of named, reorderable steps. The model executes each step on demand — with full context, with tool access, with reasoning — but never decides what step comes next. The model can suggest, but suggestions are advisory, not autonomous. This is not a worse autonomous agent; it is a different product, with a strictly worse cost-and-latency profile and a strictly better trust profile, aimed at users who would otherwise decline to adopt the autonomous version at all.

The mistake teams keep making is treating "autonomy" as a default to push toward. It is a UX axis you choose per-surface. Get the polarity wrong and you ship a feature your highest-stakes users will quietly refuse to touch.

The Eval Pickle: When Your LLM Judge Gets Smarter Than the Model It Grades

· 9 min read
Tian Pan
Software Engineer

A regression alert fires on Monday morning. Faithfulness on your held-out eval set dropped from 0.86 to 0.78 over the weekend. Nobody shipped a new model. Nobody touched the prompt. Nobody changed the retrieval index. The on-call engineer spends three hours digging before noticing the only thing that changed was the judge model — the auto-evaluator quietly rolled forward to a newer snapshot that catches subtle hedging the old one waved through. Same answers. Same model. Worse score. Real number, fake regression.

This is the eval pickle: as your LLM-as-judge gets sharper, your scores on a frozen system slide down, and the dashboard that's supposed to detect regressions starts manufacturing them. The team that doesn't notice spends quarters chasing "quality drift" that lives entirely in the ruler.