Skip to main content

160 posts tagged with "evaluation"

View all tags

Task Completion Goes Green While Users Quietly Suffer

· 8 min read
Tian Pan
Software Engineer

Your agent dashboard says 94% task completion. Leadership is happy. The roadmap gets funded. And yet support tickets are climbing, power users have gone quiet, and the one engineer who actually watches traces keeps muttering that something is wrong. Both things are true at once. The agent is completing tasks. It is also taking twelve minutes and four thousand tokens to do a two-step job, backtracking three times, and asking the user to confirm a fact it could have inferred from the first message.

Task completion is a binary that hides a distribution. "The agent finished" tells you nothing about the path it took to finish, and the path is most of what users actually experience. A completion-rate dashboard is structurally incapable of seeing a slow, expensive, annoying agent. It will stay green right up until users churn.

This is not a measurement gap you can patch with a better prompt. It is a category error in what you chose to measure. Completion is the easiest thing to instrument and the least of what people are paying for.

When Your Test Set Leaks Into Fine-Tuning: The Contamination You Cause Yourself

· 9 min read
Tian Pan
Software Engineer

Everyone in AI knows the cautionary tale of benchmark contamination: a model vendor scrapes the open web, GSM8K and MMLU end up in the pretraining corpus, and the reported scores measure recall instead of reasoning. It is treated as somebody else's sin — the foundation lab's problem, an artifact you inherit. So you build your own held-out eval set, keep it in a private repo, and assume you are clean.

You are probably not. The most damaging contamination in a production AI system is rarely inherited. It is manufactured, in-house, by well-meaning engineers following a sensible-looking workflow. Your eval set leaks into your training pipeline through doors you built yourself, and the leak is silent: every dashboard turns green at exactly the moment your benchmark stops measuring anything real.

This is the contamination you cause yourself. It deserves more attention than the kind you inherit, because you are the only one who can detect it — and almost nobody audits for it.

When the Cheap Model Is the Expensive One

· 9 min read
Tian Pan
Software Engineer

A finance team flags that the LLM bill is up 18% this quarter. An engineer pulls the usage dashboard, sees that 70% of traffic now hits the budget model instead of the frontier one, and is briefly confused: the routing change was supposed to cut spend. The per-token price went down exactly as the spreadsheet promised. The bill went up anyway.

This is not a billing error. It is the most common way a cost optimization quietly inverts itself. The spreadsheet that justified the downgrade priced one thing — tokens — and the production system pays for something else entirely: finished tasks. A weaker model does not just produce cheaper tokens. It changes the behavior of every component around it, and those second-order effects land on the same invoice.

The trap is seductive because the first-order math is genuinely correct. A budget model can be 10x to 30x cheaper per token than a frontier model, and for a large fraction of traffic it returns an answer that is indistinguishable in quality. The mistake is not the routing decision. The mistake is measuring the routing decision at the wrong boundary.

The PM-Eval Translation Gap: When Ship Decisions Outrun the Vocabulary

· 8 min read
Tian Pan
Software Engineer

The go/no-go meeting for an AI feature is, on the surface, a data-driven ritual. Engineering brings a slate of eval numbers — judge score deltas, slice accuracies, regression-against-baseline percentages — and the room decides. It looks rigorous. It usually isn't.

Here is the failure mode in one sentence: the person with the literacy to weight the eval slices does not have the authority to make the call, and the person with the authority cannot read the slices. The product manager owns the launch. The engineer owns the meaning of the numbers. Between them sits a translation gap, and into that gap rushes whoever speaks most confidently in the meeting.

The tell is that "ship at 87%" and "hold at 87%" are both defensible from the same scorecard, depending on which slice you weight. When a single dataset supports opposite conclusions and the deciding factor is rhetorical confidence rather than evidence, you do not have a data-driven process. You have a debate with a spreadsheet in the background.

When 'Can the Agent Do X?' Becomes a Ship Commitment

· 10 min read
Tian Pan
Software Engineer

An engineer spends an afternoon poking at a question: can the agent reconcile a customer's invoice against their contract terms? They wire up a quick prompt, run it on five real invoices, and three come back correct. The other two are wrong in ways they don't fully characterize — they close the laptop and move on. In standup the next morning they say "yeah, invoice reconciliation basically works." A PM in the room writes it down. Two weeks later it's a line item on the Q3 roadmap. A month after that, a sales rep promises it to an enterprise account in a renewal call.

Nobody lied. Nobody made a bad decision in isolation. But the team is now contractually committed to a behavior whose eval set does not exist, whose failure modes were never written down, and whose reliability budget was set by a director who saw a demo and interpreted it as a contract. This is the most common way AI features acquire scope: not through a planning meeting, but through a capability probe that nobody ever explicitly promoted.

The industry has a name for the downstream symptom — "POC purgatory," the state where 70 to 80 percent of AI initiatives stall between a working sandbox and a shippable product. But purgatory is the wrong metaphor, because it implies the projects are stuck. They aren't stuck. They're moving — they were committed before anyone checked whether they were ready, and now the team is trying to retrofit reliability onto a promise.

The Internal Eval Set Is a Privacy Boundary Nobody Reviewed

· 11 min read
Tian Pan
Software Engineer

The dataset your AI team calls "the eval set" is, in most companies shipping LLM features, a collection of real customer conversations pulled from production logs. Nobody on the team thinks of it as a privacy event. The data never left the cluster. No new system was provisioned. No vendor was added. An engineer wrote a query, exported a few thousand traces into a labeling tool, and the team started grading model outputs against them. The legal team never heard about it because, from the inside, nothing changed — the same conversations that already lived in the same database were now also being read by a few engineers and a judge model.

That is the privacy boundary nobody reviewed. Customers gave you their messages so you could answer them. They did not give you their messages so you could measure your model against them. The two uses look identical at the storage layer and feel identical at the inference layer, but they are different processing purposes under every modern privacy regime — and the gap between the two is where the next round of compliance pain is going to land.

Repeat-Question Detection: The Session-Level Blind Spot Your Per-Turn Eval Cannot See

· 11 min read
Tian Pan
Software Engineer

A user opens your chat, asks a question, and gets back a response your eval suite would score 4.6 out of 5. Then they ask the same question with different words. Same answer. Same score. They try once more, this time with the kind of hedging language people use when they suspect the machine isn't listening — "what I'm actually trying to do is…" — and then they close the tab. From the model's perspective, three clean Q&A turns. From the dashboard's perspective, an engaged session. From the user's perspective, a product that failed them three times in a row and won't be opened again.

This is the failure mode per-turn evaluation cannot see. Each individual turn looked correct in isolation. The judge gave a thumbs up. The hallucination detector stayed quiet. The relevance score was high. And yet the conversation, as a whole, did not resolve anything — and that's the unit the user was actually evaluating you on.

Stale Few-Shot Examples and the Half-Life Your Prompt Repo Ignores

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any AI feature that has been in production for more than nine months. Scroll past the role description, past the formatting rules, past the safety guardrails. Stop at the block titled <examples> or ## Examples or whatever your team called it the day someone copied the first three good Slack threads into a code block. Read them. There is a 60% chance at least one of them references a feature that has been renamed, a button that no longer exists, or a workflow the product manager quietly killed two quarters ago.

The decay is not visible from the eval dashboard. The eval scores are green. They have been green for months. They are green because the eval set was authored against the same product surface the few-shots reference, and the two have aged together in lockstep. The model is performing a flawless impression of last year's product, on a test set that grades it for being faithful to last year's product, while real users interact with this year's product and quietly tolerate the resulting confabulations. This is the half-life nobody puts in the LLMOps roadmap.

AI Code Review Drift: When Your LLM Reviewer's Standards Mutate Faster Than the Code

· 9 min read
Tian Pan
Software Engineer

The PR-review dashboard has shown green for six weeks. Bot catch rate, comment volume, developer "thumbs up" reactions — all steady. Then a security incident lands in production and the post-mortem points at a missing null-check the bot used to catch and quietly stopped catching about two months ago. Nobody changed the bot. Nobody downgraded the model. The dashboard never moved. The standard moved.

This is the failure mode of automated code review that doesn't show up in any product demo. Teams adopt an LLM reviewer for the consistency win — every PR gets the same checklist, no senior engineer's bad-day variance, fast turnaround for junior contributors — and the consistency is real for about a quarter. Then the system prompt evolves, the model bumps, the few-shot library accumulates, and the bot is reviewing a different codebase against a different rubric using a different model than the one the team validated against. The team's mental model of "what the bot catches" decays into "what the bot caught last week."

Prompt Portfolios: Manage a Basket, Not a Single Best Prompt

· 10 min read
Tian Pan
Software Engineer

Most production AI teams talk about prompts the way junior traders talk about stocks: there is one best one, and the job is to find it. So they iterate — a Slack thread, a few eval rows, a new winner, push to main, repeat. The result is a single artifact carrying the entire intent-resolution surface of the product, optimized against a frozen evaluation set, sitting one regrettable edit away from a P1.

The mistake is the singular. A prompt is not a security; it is an allocation. The same user intent can be served well by several variants, each with its own confidence interval, its own per-segment performance, and its own sensitivity to model and corpus drift. The right mental model is not "find the best prompt" — it is "manage a basket of prompts whose composition is itself the product." Quantitative finance figured this out fifty years ago, and the operational machinery transfers almost without modification.

Escalation Rate Is the Eval Signal Your Offline Tests Missed

· 10 min read
Tian Pan
Software Engineer

Every agent feature has a back door. Some teams call it "escalate to support." Some call it "route to a human reviewer." Some call it the templated "I'm not able to help with that — let me connect you to someone who can." Whatever the label, every production agent has a path that gives up on the user's request and hands it to a human, and the rate at which production traffic takes that path is one of the few signals that doesn't depend on labelers, judges, or a hand-built test set. It is the system telling you, in production, that the model could not handle a request the user actually sent.

That signal is almost always being read by the wrong team. Escalation rate is a workforce-planning metric in most companies: it determines how many human agents the queue needs next quarter, and it lives on a dashboard the operations team reviews on a different cadence than the AI team reads its eval scores. A 30% week-over-week escalation increase shows up as a staffing question in a Monday operations review, while the AI team's eval suite stays green and the leadership readout says the feature is healthy. Both teams are looking at the same production system and arriving at opposite conclusions: ops thinks they need more headcount, AI thinks the model is fine.

Agent Branch Coverage: Your Eval Hits the Happy Path, Not the Planner's If-Else

· 8 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a 240-case eval suite against their support agent. Green across the board for six months. Then they swapped a single sentence in the planner prompt — a tone tweak — and the next day production saw a 3× spike in human-handoff requests. The eval hadn't moved. The handoff branch had simply started firing on borderline cases that used to resolve in-line, and not a single eval case was the kind of borderline. The branch existed in the prompt. It existed in production. It did not exist in the eval.

This is the failure mode I want to name: agent branch coverage. Code-coverage tooling has been a debugging staple for forty years, but agentic systems have a runtime control flow — planner branches that pick a tool, condition the response, escalate to a human, refuse to act, retry with a different strategy — and the eval suite touches only the cases the team thought to write. Eighty percent of the planner's decision branches have never executed under test, and a green eval becomes a smoke test wearing a regression-test costume.