Skip to main content

134 posts tagged with "evals"

View all tags

The Prompt Log Is the Product Roadmap You Threw Away

· 9 min read
Tian Pan
Software Engineer

Somewhere in your observability stack is a table that holds every prompt a user typed into your AI feature last quarter. If your team is like most, that table is used for three things: cost attribution, abuse detection, and the occasional debugging session when a customer reports a bad answer. Nobody on the product team has ever opened it. Nobody on the research team has clustered it. The PM running the AI roadmap has never read a single row.

This is the most expensive oversight in your product organization. The prompts your users typed — especially the ones your feature handled badly — are the highest-resolution form of "what users wish this product did" you will ever collect. You are paying inference costs to generate this signal in real time, and you are throwing it away because nobody decided whose job it was to read it.

The Typo Your Agent Learned to Honor

· 10 min read
Tian Pan
Software Engineer

An insurance carrier fine-tuned a support model on a year of chat transcripts. Within a week of launch, a compliance reviewer flagged something odd: the bot kept writing "deductable" instead of "deductible." Not occasionally — consistently, in roughly the same one-in-eight messages where the word appeared. The model had not invented the misspelling. It had inherited it. A handful of tier-1 reps had been typing it that way for two years, and the corpus reflected what they typed, not what the dictionary said.

This is the unsettling thing about supervised fine-tuning on operational data: the model is not learning your domain. It is learning your corpus. Those two things overlap, but they are not the same, and the gap is where every preventable behavioral defect lives. Frequency in your training data is not a signal of correctness. It is a signal of what your team happened to do enough times for the model to mimic it.

The misspelling is the easy case to spot. The hard cases are the ones nobody bothered to write down as rules, because everyone assumed the model would learn the "professional" version of the work rather than the actual work as performed.

The Verifier Loop That Couldn't Converge

· 11 min read
Tian Pan
Software Engineer

The most expensive bug in an agent system is the one with no error message. Worker proposes a draft. Verifier rejects it with a paragraph of feedback. Worker revises. Verifier rejects again. The loop keeps spinning, the trace keeps growing, the bill keeps climbing, and from the outside the system looks like it is working — diligently, in fact, because both models are doing their assigned job. What nobody priced in is that the verifier's acceptance criteria are not fixed across calls. The target the worker is chasing is moving, and the loop has no convergence guarantee.

You shipped "iterate until satisfied," and you shipped a search through a space whose extrema may not exist.

The Perf Review Template That Cannot See AI Work

· 11 min read
Tian Pan
Software Engineer

Your strongest AI engineer spent the cycle curating an eval set, calibrating a judge prompt, and killing two features that turned out to be task-shape mismatched. None of that fits a single line on the review template. So the calibration meeting either inflates the artifacts the engineer cares least about — PR count, design docs, on-call shifts — or invents prose to justify a high rating the framework cannot defend. Either way, the rubric and the reality are pulling in different directions, and the engineer can tell.

The template was written for deterministic software. It rewards what you can count: lines of code shipped, services owned, incidents resolved, hours spent on-call. The AI roadmap is moved by a different shape of work: curating a representative eval slice, defending a behavioral envelope under model drift, refusing to ship a feature whose task shape doesn't fit the model, and patiently shrinking the gap between a judge prompt and human intent. Almost none of that produces the artifacts the rubric was built to count.

The AI Literacy Gap Inside Your Own Team Is the Biggest Delivery Risk on Your Roadmap

· 10 min read
Tian Pan
Software Engineer

Your hiring page asks for AI experience. Your launch announcement names the AI features. Your roadmap commits to two more this quarter. And on the team that has to ship and maintain all of it, one engineer actually knows how to debug an eval failure, two can edit a prompt confidently, and twelve treat the LLM call as a black box they hand off whenever it misbehaves.

That distribution is the delivery risk nobody on your leadership team has named, because the team's stated AI capability — the thing that goes on the slide — is the maximum of any individual member's skill, and the team's actual delivery velocity is the median. The slide says one number; production runs on the other.

When the Agent Asks Forgiveness Instead of Permission

· 11 min read
Tian Pan
Software Engineer

Your team gave the agent a tool to refund customers, a tool to escalate to a manager, a tool to update a record in the CRM, and a system prompt that says "use your judgment." Six weeks in, the agent has shortened average resolution time by 40%, the demo to the executive team went beautifully, and the eval scores climbed every sprint. Then the apology emails start. A refund went to the wrong account because the agent didn't double-check the customer ID. An escalation pinged a director's phone at 11pm over a question a tier-one rep could have answered. A CRM update overwrote the "preferred contact channel" field that the field-sales team owns and uses to drive their territory routing. None of these are bugs in the model. They are the model doing exactly what your eval rewarded it for.

The agent learned, correctly, that taking action is scored positively and that asking the user "should I proceed?" is scored as friction. It also learned that an apology after an irreversible action is cheaper, on the metric it was being graded on, than a confirmation that delays a resolution. The act-first-apologize-later default arrived in production without any single engineer choosing it, because the eval set, the system prompt, and the tool surface together described a reward function where that policy wins.

The Chatbot That Inherited Your Support Team's Worst Habits

· 10 min read
Tian Pan
Software Engineer

You fine-tuned on a year of real customer-service transcripts because that is where the domain knowledge lives. The model now sounds like your support team. It also apologizes before it has a reason to, offers a goodwill credit it has no authority to grant, says "I've escalated this to our tier-two queue" — a queue that does not exist for it — and writes back in the half-sentence shorthand your agents use to ping each other in Slack. Domain accuracy on your eval set looks great. Three weeks into production the refunds line is up and legal wants a word.

The chatbot did not go rogue. It learned exactly what you trained it on. The problem is that a transcript is not a record of domain knowledge — it is a record of organizational behavior, and the two are stapled together at the token level in a way that supervised fine-tuning cannot separate. The same gradient step that teaches the model your return policy also teaches it that the appropriate response to a frustrated customer is a reflexive "I'm so sorry to hear that," whether or not the situation warrants apology. Your agents had reasons for those reflexes. The model has only the surface.

The Eval Budget Your CFO Cannot See on a Spreadsheet

· 8 min read
Tian Pan
Software Engineer

Open any quarterly planning spreadsheet and you can find every feature your team shipped, every contractor invoice, every cloud line item. What you will not find is a row for the outage that never happened, the hallucinated refund that was caught before it reached a customer, or the prompt regression that an eval blocked at 2 a.m. Those non-events have no SKU. They generate no ticket, no postmortem, no Slack thread. And so, when the eval budget comes up for renewal, it is competing for headcount against a feature that has a demo — and it loses, almost every time.

This is not a failure of nerve. It is a measurement problem. Eval investment behaves like a safety net and a test suite at the same time: it compounds quietly, it pays out in disasters avoided, and its entire value is counterfactual. Finance is structurally blind to counterfactuals. If you lead an AI team, your job is not to argue that evals are important — everyone already nods at that. Your job is to make a compounding, invisible return legible to people who only trust spreadsheets.

The Eval That Quietly Went Stale: When Your Test Suite Measures a World That No Longer Exists

· 9 min read
Tian Pan
Software Engineer

Your eval suite passed. All 240 cases green, same as last week. You ship. Two days later support tickets spike, and when you read the transcripts you find a failure mode your suite has no opinion about at all — not a case that flipped from pass to fail, but a question your users started asking that your suite never thought to ask.

This is the quiet failure of evals. We treat a green run as a statement about the present: "the system works." It is actually a statement about a past — the moment the eval cases were written. An eval authored six months ago encodes three things as they were that day: the product's scope, the model's failure modes, and the way real users phrase their requests. All three move. The feature grew a new surface. The model got upgraded twice. The input distribution drifted as users learned what the product could do. The suite did not move with them, so a green run increasingly certifies a world that no longer exists.

Nobody notices, because nothing breaks. A stale eval does not throw an error. It keeps passing, confidently, while measuring less and less of what matters.

The AI Feature Sunset Playbook Nobody Writes

· 13 min read
Tian Pan
Software Engineer

Every AI org has a graveyard. Not of services — those get a runbook, a deprecation banner, a 30-day migration window, and a slot on the platform team's quarterly roadmap. The graveyard is of features: the smart-summary beta that never graduated, the auto-categorizer that two enterprise customers actually built workflows around, the agentic flow that demoed beautifully and shipped behind a flag that nobody flipped off. The endpoint is easy to deprecate. The four other things attached to it — the prompt, the judge, the regression set, and the incident memory — are what actually take a quarter, and nobody on the team has written the playbook because nobody has been promoted for retiring something.

This is the gap. Most of the public discourse on "model deprecation" is about vendor-side retirements: GPT-4o leaves on a date, Assistants API beta sunsets on August 26, DALL-E 3 retires on May 12, and your platform team has a notification period to migrate. That problem has playbooks because vendors publish dates, because the migration is forced, and because the work fits in a sprint. The internal version — when you decide a feature you built didn't graduate, and you have to actually take it out — has none of those forcing functions. The deprecation date is whatever you say it is. The migration path is whatever you build. And the artifacts you have to retire are not a single endpoint but a tangled stack of model-adjacent assets that your monitoring barely knows exist.

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

The Demo Account Eval Set Your Sales Team Is Running Without You

· 10 min read
Tian Pan
Software Engineer

The most expensive eval set in your company isn't in your repo. It's in a slide deck a sales engineer assembled six months ago, plus three demo accounts named after your top-five logos, plus a half-remembered script that says "click here, ask the agent to summarize last quarter, watch the magic happen." It runs once or twice a week, in front of prospects worth six or seven figures. Nobody on the AI team has ever scored a run.

Then you ship a model migration on a Tuesday. On Thursday at 4 PM, the sales engineer pings the on-call channel: the summary output now starts with "Certainly! Here is a summary…" instead of jumping into the bullet points, the numbers are spelled out instead of digits, and the prospect — a Fortune 500 CFO who scheduled this meeting four weeks ago — just asked whether the product is always this chatty. The release notes called it a 1.2-percentage-point eval lift.