Skip to main content

3 posts tagged with "llm-judge"

View all tags

The Eval Rubric Pulled By Two Drift Vectors

· 9 min read
Tian Pan
Software Engineer

Your composite eval score went up two points last quarter. Nobody can tell you whether the system got better, whether the human cohort that scores it got more lenient, or whether the judge model you upgraded in March started weighting verbosity differently. The number moved. The thing the number is supposed to measure did not necessarily move with it.

This is what happens when an eval rubric is read by two populations at once — humans and an LLM judge — and both populations drift on different axes for different reasons. The composite score blends their motion together, and unless you have a measurement protocol that holds one fixed while the other moves, you have shipped a metric whose changes are not attributable to anything.

The Eval That Converges, Then Quietly Collapses

· 11 min read
Tian Pan
Software Engineer

Your weekly eval dashboard has gone flat. The line that used to wobble between 0.71 and 0.78 has tightened to a hairline around 0.84 for three release cycles. The team reads it as a ceiling — the model is as good as the rubric allows, and further work needs a harder eval. Someone schedules a planning meeting to "design eval v2."

That reading is plausible, and sometimes correct. But there is a second explanation that produces the same picture and quietly destroys your release-gating signal: your labelers, human or LLM-judge, have homogenized around the same opinions, and the eval is no longer measuring the model. It is measuring how well the model produces the shape of output your labelers have learned to call correct.

Rater Throughput Is the Hidden Bottleneck in Your Eval Pipeline

· 10 min read
Tian Pan
Software Engineer

The team plans an eval suite the way they plan a service: failure modes inventoried, rubric drafted, sample size argued over, judge calibration scheduled. Then they file the rater capacity as a footnote — "we'll get the annotation team to grade a few hundred per week" — and ship the rest. Six weeks later the rater queue is at 4,300 items, eval velocity has collapsed to one judge-calibration cycle per month, and someone in a planning review says the quiet part out loud: nobody capacity-planned the humans.

Rater throughput is the binding constraint on eval velocity in any AI system that takes human grading seriously, and the discipline that treats annotation as an SRE problem rather than a recruiting one is the one that ships. A human reviewer processes 50–100 examples per hour at expert difficulty, and an expert annotator caps out around 500–1,000 examples per week — those numbers are not a recruiting problem to be brute-forced with headcount. They are an operational property of the eval system that has to be modeled and budgeted the way you model database IOPS.