4 posts tagged with "llm-eval"

Eval Datasets Are Customer Data With a Right Answer Attached

May 10, 2026 · 12 min read

Software Engineer

Your golden eval set is a privacy boundary your security team didn't know existed. It is built by sampling production traces, which means it is a curated collection of real customer queries — often containing names, emails, account numbers, transcripts of frustrated calls, half-typed credit card digits — paired with the canonical correct response on top, and then committed to whatever bucket the eval pipeline reads from.

That last part is what makes eval data uniquely dangerous. A raw production trace is sensitive because it captures what the customer said. An eval case is sensitive in a new way because it captures what the customer said plus the labeled correct answer. The label is a derivative work that someone, often an annotator or a domain expert, applied with intent. It signals "this is canonical." It gives the trace a longevity that the original log never had — log retention will eventually rotate the trace out, but the eval case is now a permanent test fixture that the team is committed to keeping green.

Your Inference Chargeback Is Quietly Taxing Eval Discipline

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

The FinOps team rolled out chargeback for AI a year ago. The dashboard is gorgeous. Every feature team can see, to the cent, what their inference bill was last month, and the platform PM has slides showing line-of-business attribution at the SKU level. The org has more AI features than it had a year ago. It also has worse AI quality. Nobody has connected the two facts yet, but they are the same fact.

Here is the failure mode in one sentence: chargeback prices the inference token and silently fails to price the eval token, so every PM on the org chart faces an incentive structure that rewards model upgrades and punishes evaluation discipline. Twelve months later, eval coverage is shrinking while the bill is growing — the precise opposite of what the FinOps initiative thought it was incentivizing. This is not a bug in the dashboard. It is the chargeback model functioning exactly as designed, in a domain where the design assumptions from cloud-cost FinOps no longer hold.

The LLM-Judge Ceiling: Why Your Auto-Eval Stops Correlating With Users at the Score That Matters

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

LLM-as-judge is the productivity unlock that let evaluation coverage scale 10x without growing the human grading team. The problem is that the unlock is not uniform across the score range. The judge's agreement with humans is highest in the muddy middle of the distribution — the answers nobody is going to escalate either way — and collapses on the long tail of high-stakes outputs that actually decide whether a feature ships, gets rolled back, or paged at 2am. The dashboard graph stays green through the score range that nobody is ever happy with.

That is the LLM-judge ceiling: a measurement instrument with a non-uniform error profile that the team is reading as a single number. Aggregate agreement of 80% with humans is the headline most vendors put on the page; it is also the number that gets the team to trust the judge most where the judge is least informative.

The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

A feature that fails 70% of the time is harmless. The user learns within a week that they have to verify every output, treats the system as an unreliable assistant, and adjusts. A feature that succeeds 70% of the time is worse than that. It is right often enough that the user stops verifying, and wrong often enough that the failures are concentrated, visible, and personal. The user's mental model collapses into "I cannot tell when to trust this" — which, as a product experience, is strictly worse than "I know not to trust this."

This is the 70% uncanny valley, and it is where most AI features built in the last two years live. The team measures aggregate accuracy, watches the number cross some "good enough" threshold, and ships. The realized user experience does not improve monotonically with that number. Between roughly 60% and 85% accuracy, the product gets worse as it gets more accurate, because the cost of a wrong answer the user did not think to check exceeds the value of a right answer they no longer have to verify.

The team that ships at 70% without designing for the predictability problem is not shipping a worse version of a 95% product. They are shipping a different product entirely: one whose primary failure mode is silent.

About Tian Pan