Skip to main content

113 posts tagged with "evals"

View all tags

Your Prompts Ship Like Cowboys: Why Code Review Discipline Doesn't Extend to AI Artifacts

· 11 min read
Tian Pan
Software Engineer

Walk through any mature engineering team's PR queue and you will see the same thing: a four-line bug fix attracts three rounds of comments about naming, error handling, and missing test coverage, while a forty-line edit to the system prompt sails through with a single "LGTM, ship it." The author shrugged because the diff looks like documentation. The reviewer shrugged because they have no mental model of what "good" looks like inside that block of English. The result is a prompt change with the blast radius of a feature launch, reviewed at the bar of a typo fix.

This is the quiet quality crisis of every team building with LLMs in production. The codebase has decades of accumulated discipline — linters, type checks, code owners, test gates, deploy windows. The artifacts that actually steer the model — the system prompt, the eval rubric, the tool description, the few-shot exemplars — sit in the same repo and ship through a review process that was designed for English prose. So prompt regressions, eval-rubric drift, and tool-schema breakages land at a quality bar the team would never accept for code.

Why Your Bias Eval Passes in CI and Fails in Deployment

· 10 min read
Tian Pan
Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.

The Eval Bottleneck: Your Eval Engineer Is Now the Roadmap

· 11 min read
Tian Pan
Software Engineer

The constraint on your AI roadmap isn't GPU capacity, model availability, or prompt-engineering taste. It's the calendar of one or two engineers who actually know how to build an eval that catches a regression. Every PM with a feature is in their queue. Every model upgrade is in their queue. Every cohort drift, every prompt revision, every "is this judge still calibrated" question lands in the same inbox. And the engineer in question said "no, this isn't ready" three times this quarter, got overruled twice, watched the regression compound in production, and is now updating their LinkedIn.

This is the eval bottleneck, and most orgs don't see it until it bites. Through 2025 the visible scaling story was AI engineers — hire AI engineers, ship AI features, iterate on prompts, swap models. By Q1 2026 the throughput problem moved one layer down. The team that doubled its AI headcount discovered that adding more feature engineers didn't make features ship faster, because every feature still needed an eval, and the eval engineer was the same person.

Eval Differential as Branch Protection: Ship Score Diffs, Not Score Floors

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a clean-looking eval gate: every prompt PR had to score above 0.85 on the golden set or the merge button stayed grey. They were proud of it. Six weeks in, average quality had quietly drifted from 0.93 to 0.87 — every PR cleared the bar, every PR landed, and no individual change owned the regression because none of them broke the rule. The bar was set against a snapshot of last quarter's quality, not against last week's.

That's the failure mode of an absolute-threshold eval gate: a PR that drops the score from 0.92 to 0.86 ships green, while a PR that lifts the score from 0.80 to 0.84 fails the same gate. The team learns "ship if it clears the bar" — a quality story. The signal you actually want is "ship if this change is non-regressive on the slices that matter" — a regression-detector story.

Coverage tools figured this out a decade ago. They report the diff against the parent commit and they break it down per file. Eval gates haven't caught up.

Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season

· 12 min read
Tian Pan
Software Engineer

The dashboard fired its first regression alert on a Monday morning in late January. Quality score on the support assistant dropped three points overnight. No prompt change shipped over the weekend. No model swap. The eval suite — a hand-curated 800-row gold set that the team had built six months earlier — was unchanged. Somebody opened an incident.

Two days of bisecting later, the answer was uninteresting and structural. It was the first business Monday after the IRS opened tax filing for the year. Half the inbound queries had shifted from "where is my paycheck deposit" to "how do I report a 1099-K from a payment app." The eval set, sampled in summer, had nothing to say about a 1099-K. The model wasn't worse. The customer was different. The gate was calibrated against a customer who no longer existed.

This pattern repeats every quarter in every product that has a seasonal user — fintech in tax season, sales tools at end-of-quarter, education at back-to-school, e-commerce in returns season, travel at booking season, healthcare at enrollment season. The eval-set-as-fixed-asset is a comfortable abstraction, and it is wrong on a calendar that nobody updates.

The LLM SDK Upgrade Tax: Why a Patch Bump Is a Model Rollout in Disguise

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a regression to production at 2:14 a.m. on a Tuesday. The on-call alert fired because the JSON parser downstream of their summarization agent was rejecting one in twenty responses with a trailing-comma error. The model hadn't changed. The prompt hadn't changed. The eval suite had passed at 96.4% the night before, comfortably above the 95% gate. What had changed was a single line in package.json: the model provider's SDK had moved from 4.6.2 to 4.6.3. Patch bump. Auto-merged by the dependency bot. The release notes said "internal cleanups."

The "internal cleanup" was a tightened JSON-mode parser that now stripped a forgiving fallback path, which had been quietly fixing a recurring trailing-comma quirk in the model's tool-call output. The model's behavior was unchanged. The SDK's interpretation of that behavior was not. The team's eval suite never saw the regression because the eval suite ran against a different SDK version than the one the dependency bot had just promoted.

This is the LLM SDK upgrade tax, and it is one of the quietest, most expensive failure modes in production AI today. The SDK is not a passive transport. It is an active participant in your prompt's behavior, and the team that upgrades it without an eval is doing a model rollout in disguise.

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart

· 11 min read
Tian Pan
Software Engineer

A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.

This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.

The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.

The Agent Finished Into an Empty Room: Stale-Context Delivery for Async Background Tasks

· 10 min read
Tian Pan
Software Engineer

A background agent that takes ninety seconds to finish a task is operating on a snapshot of the world from ninety seconds ago. By the time it returns, the user may have navigated to a different view, started a new conversation, archived the original request, or closed the tab entirely. Most agent frameworks ship the result anyway, mutate state to reflect it, and treat the round trip as a success. It is not a success. It is the agent finishing into an empty room.

The failure mode is uglier than dropping the result. A dropped result is a missed delivery — annoying but recoverable. An applied stale result is an answer to a question the user is no longer asking, written against state that no longer matches, often overwriting the work the user moved on to. The user notices that something they did not ask for has happened, cannot reconstruct why, and loses trust in the system in a way that a simple timeout never would.

The fix is not faster agents. It is a delivery-time relevance gate that treats the moment of return as a fresh decision, not the foregone conclusion of the moment of dispatch.

Why Deprecating an AI Feature Is Harder Than You Think: Users Built Trust Scaffolding You Can't See

· 10 min read
Tian Pan
Software Engineer

When OpenAI tried to pull GPT-4o from ChatGPT in August 2025, the backlash was strong enough — organized hashtags, paying users threatening to cancel, public reversal within days — that the company restored it as a default option and promised "substantial notice" before any future removals. The replacement was, by every benchmark the team cared about, better. None of that mattered. Users had spent months learning the model's quirks, calibrating their judgment to its failure modes, and integrating its specific phrasing into workflows the team had never instrumented. Replacing it with "the better version" reset that calibration to zero.

This is the failure mode that the standard deprecation playbook does not cover. Sunsetting a regular SaaS feature — announce, migrate, dark-launch the removal, retire — assumes the user contract is the API surface. For AI features, the contract is the observed behavior of the model: phrasings, tendencies, failure modes, the specific way it handles ambiguity. Users build scaffolding on top of that behavior, and most of the scaffolding lives in their heads, on their laptops, and in downstream systems your team never touches.

Contract Tests for LLM Tool Surfaces: When the Vendor Changes a Field and Your Agent Silently Adapts

· 11 min read
Tian Pan
Software Engineer

A vendor flipped "items" to "results" in a tool response last Tuesday. The agent didn't crash. It re-planned around the new shape, returned a confident-looking answer that was missing two-thirds of the rows, and the on-call engineer found out three days later when a customer asked why their export was short. No exception fired. No alert tripped. The eval suite, which runs against a frozen fixture from before the vendor change, was green the whole time.

This is the failure mode that contract testing was invented to catch in microservices a decade ago, and the one that almost no agent stack has any equivalent for today. HTTP services have Pact, schemathesis, and oasdiff to sit between consumer and provider and refuse to let breaking changes ship. The tools you hand to your agent — REST endpoints, internal RPCs, vendor SDKs, MCP servers — have nothing comparable. The model absorbs the change, adapts gracefully, and produces a degraded answer that looks correct.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.