Skip to main content

3 posts tagged with "llm-eval"

View all tags

Your Inference Chargeback Is Quietly Taxing Eval Discipline

· 12 min read
Tian Pan
Software Engineer

The FinOps team rolled out chargeback for AI a year ago. The dashboard is gorgeous. Every feature team can see, to the cent, what their inference bill was last month, and the platform PM has slides showing line-of-business attribution at the SKU level. The org has more AI features than it had a year ago. It also has worse AI quality. Nobody has connected the two facts yet, but they are the same fact.

Here is the failure mode in one sentence: chargeback prices the inference token and silently fails to price the eval token, so every PM on the org chart faces an incentive structure that rewards model upgrades and punishes evaluation discipline. Twelve months later, eval coverage is shrinking while the bill is growing — the precise opposite of what the FinOps initiative thought it was incentivizing. This is not a bug in the dashboard. It is the chargeback model functioning exactly as designed, in a domain where the design assumptions from cloud-cost FinOps no longer hold.

The LLM-Judge Ceiling: Why Your Auto-Eval Stops Correlating With Users at the Score That Matters

· 10 min read
Tian Pan
Software Engineer

LLM-as-judge is the productivity unlock that let evaluation coverage scale 10x without growing the human grading team. The problem is that the unlock is not uniform across the score range. The judge's agreement with humans is highest in the muddy middle of the distribution — the answers nobody is going to escalate either way — and collapses on the long tail of high-stakes outputs that actually decide whether a feature ships, gets rolled back, or paged at 2am. The dashboard graph stays green through the score range that nobody is ever happy with.

That is the LLM-judge ceiling: a measurement instrument with a non-uniform error profile that the team is reading as a single number. Aggregate agreement of 80% with humans is the headline most vendors put on the page; it is also the number that gets the team to trust the judge most where the judge is least informative.

The 70% Reliability Uncanny Valley: Where AI Features Go to Lose User Trust

· 12 min read
Tian Pan
Software Engineer

A feature that fails 70% of the time is harmless. The user learns within a week that they have to verify every output, treats the system as an unreliable assistant, and adjusts. A feature that succeeds 70% of the time is worse than that. It is right often enough that the user stops verifying, and wrong often enough that the failures are concentrated, visible, and personal. The user's mental model collapses into "I cannot tell when to trust this" — which, as a product experience, is strictly worse than "I know not to trust this."

This is the 70% uncanny valley, and it is where most AI features built in the last two years live. The team measures aggregate accuracy, watches the number cross some "good enough" threshold, and ships. The realized user experience does not improve monotonically with that number. Between roughly 60% and 85% accuracy, the product gets worse as it gets more accurate, because the cost of a wrong answer the user did not think to check exceeds the value of a right answer they no longer have to verify.

The team that ships at 70% without designing for the predictability problem is not shipping a worse version of a 95% product. They are shipping a different product entirely: one whose primary failure mode is silent.