Skip to main content

4 posts tagged with "llm-as-judge"

View all tags

LLM-as-Judge Drift: When Your Evaluator Upgrades and All Your Numbers Move

· 11 min read
Tian Pan
Software Engineer

A regression suite that flips green-to-red without a single prompt change is usually one of three things: a broken test harness, a flaky retrieval store, or a judge that learned new taste over the weekend. The third one is the most common and the least debugged, because no commit in your repo caused it. The scoring model got a silent quality refresh, and every score you compare against last month's dashboard is now denominated in a different currency.

This is the uncomfortable part of LLM-as-judge: you have two moving models, not one. The candidate model is the thing you ship; the judge model is the thing that tells you how the candidate is doing. When both evolve independently, score deltas stop meaning what they used to, and the dashboard that your PM refreshes every morning quietly lies.

Judge Model Independence: Why Your Eval Breaks When the Grader Shares Blind Spots with the Graded

· 9 min read
Tian Pan
Software Engineer

Your eval suite scores 91%. Users report the system feels unreliable. The post-mortem reveals the culprit: you used GPT-4o to both generate responses and grade them. The model was judging its own mirror image, and it liked what it saw.

This is the judge model independence problem. It is more widespread than most teams realize, the score inflation it produces is large enough to matter, and the fix is neither complicated nor expensive. But you have to know to look for it.

LLM Evals: What Actually Works and What Wastes Your Time

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications fall into one of two failure modes. The first is building no evals at all and shipping features on vibes. The second is building elaborate evaluation infrastructure before they understand what they're actually trying to measure. Both are expensive mistakes.

The teams that do evals well share a common approach: they start by looking at data, not by building systems. Error analysis comes before evaluation automation. Human judgment grounds the metrics before any automated judge is trusted. And they treat evaluation not as a milestone to cross but as a continuous discipline that evolves alongside the product.

This is what evals actually look like in practice — the decisions that matter, the patterns that waste effort, and the tradeoffs that aren't obvious until you've been burned.

Why Your LLM Evaluators Are Miscalibrated — and the Data-First Fix

· 9 min read
Tian Pan
Software Engineer

Most teams build their LLM evaluators in the wrong order. They write criteria, then look at data. That inversion is the root cause of miscalibrated evals, and it's almost universal in teams shipping their first AI product. The criteria sound reasonable on paper — "the response should be accurate, helpful, and concise" — but when you apply them to real model outputs, you discover the rubric doesn't match what you actually care about. You end up with an evaluator that grades things you're not measuring and misses failures that matter.

The fix isn't a better rubric. It's a different workflow: look at the data first, define criteria second, and then validate your evaluator against human judgment before trusting it to run unsupervised.