Skip to main content

113 posts tagged with "evals"

View all tags

The AI Feature Sunset Playbook Nobody Writes

· 13 min read
Tian Pan
Software Engineer

Every AI org has a graveyard. Not of services — those get a runbook, a deprecation banner, a 30-day migration window, and a slot on the platform team's quarterly roadmap. The graveyard is of features: the smart-summary beta that never graduated, the auto-categorizer that two enterprise customers actually built workflows around, the agentic flow that demoed beautifully and shipped behind a flag that nobody flipped off. The endpoint is easy to deprecate. The four other things attached to it — the prompt, the judge, the regression set, and the incident memory — are what actually take a quarter, and nobody on the team has written the playbook because nobody has been promoted for retiring something.

This is the gap. Most of the public discourse on "model deprecation" is about vendor-side retirements: GPT-4o leaves on a date, Assistants API beta sunsets on August 26, DALL-E 3 retires on May 12, and your platform team has a notification period to migrate. That problem has playbooks because vendors publish dates, because the migration is forced, and because the work fits in a sprint. The internal version — when you decide a feature you built didn't graduate, and you have to actually take it out — has none of those forcing functions. The deprecation date is whatever you say it is. The migration path is whatever you build. And the artifacts you have to retire are not a single endpoint but a tangled stack of model-adjacent assets that your monitoring barely knows exist.

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

The Demo Account Eval Set Your Sales Team Is Running Without You

· 10 min read
Tian Pan
Software Engineer

The most expensive eval set in your company isn't in your repo. It's in a slide deck a sales engineer assembled six months ago, plus three demo accounts named after your top-five logos, plus a half-remembered script that says "click here, ask the agent to summarize last quarter, watch the magic happen." It runs once or twice a week, in front of prospects worth six or seven figures. Nobody on the AI team has ever scored a run.

Then you ship a model migration on a Tuesday. On Thursday at 4 PM, the sales engineer pings the on-call channel: the summary output now starts with "Certainly! Here is a summary…" instead of jumping into the bullet points, the numbers are spelled out instead of digits, and the prospect — a Fortune 500 CFO who scheduled this meeting four weeks ago — just asked whether the product is always this chatty. The release notes called it a 1.2-percentage-point eval lift.

When Marketing Reads Your Eval Cases: The Cross-Functional Visibility Problem

· 11 min read
Tian Pan
Software Engineer

The eval set is the most-read artifact your AI team produces, and you almost certainly don't know who's reading it. The repo is private, the CI job is internal, the file is one directory above prompts/ — and yet a sales engineer scraped six cases for a demo last quarter, a marketing analyst pulled three failure cases into a "look how robust our system is" deck, customer success cited eval pass-rates verbatim in a renewal call, and product treats the file as the hidden spec the AI team won't share. The case files are read by more people than the code that generated them, and nobody on the AI team has noticed.

This isn't a permissions failure. The eval set is on the same Git server as the rest of the codebase, with the same access controls as every other engineering artifact. The problem is that the AI team is the only group that treats the eval set as code. Everyone else treats it as documentation, as marketing material, as a product spec, or as a customer complaint log — and each of those readings extracts a different slice of the same file, packages it for a different audience, and ships it somewhere the AI team isn't watching.

The First 90 Days for an AI Engineer: An Onboarding Playbook That Survives the Six-Week Doc Rot

· 12 min read
Tian Pan
Software Engineer

The new hire opens the onboarding doc. It points at a service architecture diagram from eleven months ago, a Confluence page titled "Our LLM Stack" last edited in October, and a Notion table of "model providers we use." Nothing in any of these documents tells them which prompt was tuned against which failure mode, which eval cases were added after which incident, which judge was recalibrated when the model bumped from 4.5 to 4.6, or why the system prompt for the support agent has a strange three-line preamble nobody wants to touch. Two weeks in, they ship a "small prompt cleanup" PR that removes the preamble. The eval suite passes. Production accuracy drops four points within a day.

The standard new-hire onboarding playbook — read the architecture doc, set up your laptop, do your first PR by week two — was built for engineers who join services. AI engineers join a different artifact. The thing they're going to be editing isn't a 5,000-line Go service that some staff engineer wrote; it's a 30-line prompt that survived eleven incidents and seventeen eval-driven rewrites, and the meaning of those thirty lines lives in the heads of two people on the team. Your onboarding doc cannot capture that, and trying to write a longer doc is the wrong fix.

Free-Tier Traffic Is Your Real Eval Set

· 10 min read
Tian Pan
Software Engineer

The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.

This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.

Locale-Stratified Evals: How to Catch Non-English Regressions Your English Test Set Can't See

· 12 min read
Tian Pan
Software Engineer

Your aggregate eval score is up 1.2 points after the last prompt change. Your CSAT on French queries dropped four points the same week. Both numbers are correct. The reason they disagree is that the eval set is 88% English, 6% Spanish, and the rest is a long tail none of which sees enough traffic to move the rollup. The French regression is in your data — it is just sitting at three decimal places below the noise floor of your top-line metric.

This is the most common shape of locale drift I see in production AI systems: not a sudden collapse, not a translated-string bug, but a steady performance gap that the rollup hides and the support queue eventually surfaces. By the time someone in the Paris office forwards a screenshot, you have shipped two more prompt changes on top of the regression and the bisect costs three engineering days.

Model Migration Bills You Twice: The Eval Re-Anchoring Tax Nobody Prices

· 10 min read
Tian Pan
Software Engineer

Every model upgrade gets sold to the team as a swap: a one-line config change, a measurable win on latency or cost or quality, and a few days of prompt re-tuning to absorb the new model's quirks. The procurement deck shows per-token deltas, the engineering ticket lists the rollout phases, and the FP&A team books the quarterly savings. Then the eval scores come in and nobody recognizes them. Quality is flat where it should have moved. Two judges that used to agree are now diverging by ten points. The snapshot suite is red, but the diffs look like rewordings. Somebody in standup asks the question that should have been on the migration plan from day one: what is the model actually scoring against?

This is the second bill — the eval re-anchoring tax — and it is reliably larger than the first. The human-annotated reference scores were anchored to the previous model's output distribution. The LLM-as-judge graders were calibrated against the old model's failure modes. The snapshot fixtures captured the old model's wording. The team's intuition for "good output" was trained on the old model's stylistic tells. None of that survives the swap intact.

On-Call at 3am for an AI Feature That Didn't 500

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 3:02 AM. You squint at your phone expecting the usual: a database failover, a CDN edge that wandered off, a 500 spike from a service nobody touched in eight months. Instead the alert reads: summarizer.eval-on-traffic.helpfulness rolling-1h: 4.21 → 4.05 (Δ -0.16). No HTTP error. No latency spike. No service is down. Every request the system served in the last hour returned a 200 with a body that parsed cleanly. And yet something is unmistakably worse than it was at midnight, and the rotation expects you to figure out what.

This is the on-call shift the standard runbook wasn't written for. The thing that broke didn't break — it regressed. The error budget you've been tracking for years is denominated in availability and latency, and the failure mode that paged you isn't visible in either. The page is real, the customer impact is real, and your usual diagnostic loop — check the deploy log, check the dependency graph, find the bad release, roll it back — runs into a wall the moment you realize that "the bad release" might be a 30-line system-prompt diff that landed at 4 PM yesterday and looked completely innocuous in code review.

The Prompt Graph Inside Your Agent: Cross-Prompt Regression Chains Nobody Mapped

· 11 min read
Tian Pan
Software Engineer

A senior engineer ships a four-word edit to the planner prompt — "if uncertain, ask first." The planner's own eval set, which grades whether plans are reasonable, moves up by half a point. They merge. Two weeks later, the verifier's eval shows a three-point pass-rate regression and nobody can repro it. The root cause turns out to be that the planner now asks more clarifying questions, the executor receives shorter task descriptions on the second turn, the verifier's rubric was implicitly tuned against the previous executor's longer outputs, and an edit nobody flagged as risky has shifted three downstream distributions at once.

This is what happens when you treat the prompts inside an agent as a flat folder of files instead of as a graph with edges. The prompts have owners. The edges between them have nobody.

RAG Against a Phantom Inventory: When Your Corpus Describes Features Your Product Removed

· 11 min read
Tian Pan
Software Engineer

A customer asks your support agent how to do something. The agent retrieves three documentation chunks with high relevance scores, synthesizes a confident answer, and walks the customer through a five-step procedure that ends on a button that hasn't existed for four months. The customer files a ticket. The on-call engineer pulls the eval suite, finds it green, pulls the retrieval traces, finds them green too — the model didn't hallucinate, it faithfully quoted documentation describing a feature your product team renamed in the last quarterly release.

This is the failure mode I want to name: not a hallucination, not a retrieval miss, but a phantom inventory problem. Your retrieval corpus is a snapshot of a product surface that no longer exists. The vector store doesn't know the product changed. The eval suite doesn't know either. The only system that consistently catches it is the support ticket queue, and by the time a ticket is filed the customer has already been told to click a button that isn't there.

Rater Throughput Is the Hidden Bottleneck in Your Eval Pipeline

· 10 min read
Tian Pan
Software Engineer

The team plans an eval suite the way they plan a service: failure modes inventoried, rubric drafted, sample size argued over, judge calibration scheduled. Then they file the rater capacity as a footnote — "we'll get the annotation team to grade a few hundred per week" — and ship the rest. Six weeks later the rater queue is at 4,300 items, eval velocity has collapsed to one judge-calibration cycle per month, and someone in a planning review says the quiet part out loud: nobody capacity-planned the humans.

Rater throughput is the binding constraint on eval velocity in any AI system that takes human grading seriously, and the discipline that treats annotation as an SRE problem rather than a recruiting one is the one that ships. A human reviewer processes 50–100 examples per hour at expert difficulty, and an expert annotator caps out around 500–1,000 examples per week — those numbers are not a recruiting problem to be brute-forced with headcount. They are an operational property of the eval system that has to be modeled and budgeted the way you model database IOPS.