Skip to main content

160 posts tagged with "evaluation"

View all tags

Building LLM Evals from Sparse Annotations: You Don't Need 10,000 Examples

· 12 min read
Tian Pan
Software Engineer

Teams building LLM applications consistently make the same mistake: they wait for enough labeled data before investing in evaluation infrastructure. They tell themselves they need 5,000 examples. Or 10,000. The eval system stays on the backlog while "vibe checks" substitute for measurement. A ZenML analysis of 1,200 production deployments found that informal vibe checks remain common even in mature deployments — and many teams never graduate to systematic evals at all.

The data-size intuition is borrowed from classical ML, where more labeled examples reliably improved model performance. For LLM evaluation, it is largely wrong. Research on sparse benchmarks demonstrates that 20–40 carefully selected items reliably estimate full-benchmark rankings, and 100 items produce mean absolute error below 1% compared to thousands. The problem is not data volume. The problem is that most teams skip the structured process that makes small evaluation sets trustworthy.

This post covers what that process actually looks like: how to select the right examples through active learning, how to generate noisy labels at scale with weak supervision, how to bootstrap with LLM judges, and how to know when your small eval set is ready to use.

Judge Model Independence: Why Your Eval Breaks When the Grader Shares Blind Spots with the Graded

· 9 min read
Tian Pan
Software Engineer

Your eval suite scores 91%. Users report the system feels unreliable. The post-mortem reveals the culprit: you used GPT-4o to both generate responses and grade them. The model was judging its own mirror image, and it liked what it saw.

This is the judge model independence problem. It is more widespread than most teams realize, the score inflation it produces is large enough to matter, and the fix is neither complicated nor expensive. But you have to know to look for it.

Keeping Synthetic Eval Data Honest

· 9 min read
Tian Pan
Software Engineer

A safety model scored 85.3% accuracy on its public benchmark test set. When researchers tested it on novel adversarial prompts not derived from public datasets, that number dropped to 33.8%. The model hadn't learned to reason about safety. It had learned to recognize the evaluation distribution.

This is the problem at the center of synthetic eval data: when the same model family generates both your training data and your test cases, passing the eval means conforming to a shared statistical prior—not demonstrating actual capability. It's a feedback loop that looks like quality assurance until production traffic arrives and the numbers don't hold.

The failure is structural, not incidental. And fixing it requires more than adding more synthetic examples.

Multi-Session Eval Design: Catching the AI Feature That Gets Worse Over Time

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed every eval at launch. Six weeks in, churn in the cohort that talks to it most has doubled, and your CSAT dashboard shows a flat line that no one can explain. The prompts haven't changed, the model hasn't been swapped, the retrieval index has grown but nobody thinks it's broken. What shipped was fine on turn one. What rots is what happens on turn four hundred, in session seventeen, three weeks after signup.

Most teams' eval suites can't see this failure. They test single-turn accuracy on a fixed dataset, maybe single-session multi-turn if they're ambitious, and then declare the feature shippable. The failure mode that matters — quality that degrades as the system accumulates state about a user — lives in a temporal dimension the eval harness was never built to cover. Researchers call it "self-degradation" in the memory literature: a clear, sustained performance decline after the initial phase, driven by memory inflation and the accumulation of flawed memories. Production engineers call it the reason their retention cohort silently bleeds.

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric

· 11 min read
Tian Pan
Software Engineer

When your LLM feature ships, your monitoring dashboard probably tracks accuracy, latency, and error rate. What it almost certainly does not track is variance — how wildly different the output is each time a user sends the same prompt. That gap is where production AI features quietly collapse.

Variance determines whether your product feels trustworthy or capricious. A feature that scores 88% on your eval suite but delivers a two-sentence answer 40% of the time and a ten-paragraph essay the other 60% will erode user trust faster than one that scores 80% but behaves consistently. Teams optimizing exclusively for accuracy are solving the wrong half of the reliability problem.

The prompt entropy budget is the concept that fills this gap: a structured approach to measuring, budgeting, and controlling the distribution of outputs your model produces over identical inputs — treated the same way you treat p99 latency or error budget in your SLO framework.

Why Your AI Demo Always Outperforms Your Launch

· 8 min read
Tian Pan
Software Engineer

The demo was spectacular. The model answered every question fluently, summarized documents without hallucination, and handled every edge case you threw at it. Stakeholders were impressed. The launch date was set.

Three weeks after shipping, accuracy was somewhere around 60%. Users were confused. Tickets were piling up. The model that aced your showcase was stumbling through production traffic.

This is not a story about a bad model. It is a story about a mismatch that almost every team building LLM features encounters: the inputs you tested on are not the inputs your users send.

Your LLM Eval Is Lying to You: The Statistical Power Problem

· 9 min read
Tian Pan
Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.

Goodhart's Law in Your LLM Eval Suite: When Optimizing the Score Breaks the System

· 9 min read
Tian Pan
Software Engineer

Andrej Karpathy put it bluntly: AI labs were "overfitting" to Arena rankings. One major lab privately evaluated 27 model variants before their public release, publishing only the top performer. Researchers estimated that selective submission alone could artificially inflate leaderboard scores by up to 112%. The crowdsourced evaluation system that everyone pointed to as ground truth had become a target — and once it became a target, it stopped being a useful measure.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. It's been well-understood in economics and policy for decades. In LLM engineering, it's actively destroying eval suites right now, often without the teams building them realizing it.

Spec-to-Eval: Translating Product Requirements into Falsifiable LLM Criteria

· 9 min read
Tian Pan
Software Engineer

Most AI features are specified in prose and evaluated in prose. The PM writes "the assistant should respond helpfully and avoid harmful content." The engineer ships a prompt that, at demo time, produces output that seems to match. The team agrees at standup. They disagree at launch — when edge cases surface, when different engineers assess the same output differently, and when "helpful" turns out to mean seven different things depending on who's reviewing.

This isn't a tooling problem. It's a translation problem. The spec stayed abstract; the evaluation criteria were never made concrete. Spec-to-eval is the discipline of converting English requirements into falsifiable criteria before you write a single prompt — and doing it upfront changes everything about how fast you iterate.

The Requirements Gap: How to Write Specs for AI Features When 'Correct' Is a Distribution

· 10 min read
Tian Pan
Software Engineer

Here is a spec that ships broken AI features on a predictable schedule: "The assistant should accurately answer customer questions and maintain a helpful tone." Every stakeholder nodded, the PRD was approved, and six months later the team is arguing in a post-mortem about whether an 87% accuracy rate was acceptable — a question nobody thought to answer before launch.

The failure is not technical. The model may have been fine. The failure is that the requirements format imported directly from traditional software left no room for the defining property of AI outputs: they are probabilistic. "Correct" is not a state; it is a distribution. And you cannot specify a distribution with a user story.

The Second Opinion Economy: When Dual-Model Verification Actually Pays Off

· 10 min read
Tian Pan
Software Engineer

The most seductive idea in AI engineering is that you can make any LLM system more reliable by running a second LLM to check the first one's work. On paper, it's obvious. In practice, teams that deploy this pattern naively often end up with 2x inference costs and a false sense of confidence — their "verification" is just the original model's biases running twice.

Done right, dual-model verification produces real accuracy gains: 6–18% on reasoning tasks, measurable improvements in RAG faithfulness, and meaningful catches in code correctness. Done wrong, two models agreeing on the same wrong answer is worse than one model failing, because now you've also disabled your uncertainty signal.

This post is about knowing the difference.

The Five Gates Your AI Demo Skipped: A Launch Readiness Checklist for LLM Features

· 12 min read
Tian Pan
Software Engineer

There's a pattern that repeats across AI feature launches: the demo wows the room, the feature ships, and within two weeks something catastrophic happens. Not a crash — those are easy to catch. Something subtler: the model confidently generates wrong information, costs spiral three times over projection, or latency spikes under real load make the feature unusable. The team scrambles, the feature gets quietly disabled, and everyone agrees to "do it better next time."

The problem isn't that the demo was bad. The problem is that the demo was the only test that mattered.