Your LLM Judge Has a Length Bias, a Position Bias, and a Format Bias — and Nobody Is Auditing Yours
A team I worked with last quarter watched their LLM-as-judge score climb from 78% to 91% over six weeks of prompt iteration. They shipped. Users hated it. The new prompt produced longer, more formatted, more confident-sounding answers — and the judge loved every one of them. The team had not built a smarter prompt. They had reverse-engineered their judge's biases.
This is the failure mode nobody on the team is auditing. LLM-as-judge has well-documented systematic biases: longer answers score higher regardless of quality, the first option in pairwise comparisons wins more often than chance, and outputs that look like the judge's own training distribution outscore outputs that do not. If you wired up an LLM judge twelve months ago and have never re-validated it against humans, your scores are not a quality signal — they are a measurement of how well your prompt has learned to game its own evaluator.
The depressing part is that the audit methodology to catch this is straightforward, the calibration discipline that prevents it is cheap, and almost no team runs either.
The Three Biases That Show Up in Every Production Judge
The bias literature on LLM evaluators is by now extensive enough to be embarrassing. A 2024 systematic study of position bias across six LLM judges and 22 tasks found that position bias is not random noise — it varies significantly across judges and is most pronounced when the quality gap between candidates is small, which is exactly the regime your eval is asking the judge to discriminate in. A separate framework cataloging biases in LLM-as-judge enumerated twelve distinct categories. The three that bite hardest in production are length, position, and format.
Length (verbosity) bias. LLM judges prefer longer responses. The effect is robust enough that AlpacaEval 2.0 explicitly built a regression-based length-controlled win rate to debias the leaderboard, and the original AlpacaEval was known to favor models that simply generated more tokens. The mechanism is partly explained by training distribution — RLHF tends to reward more thorough-looking responses — and partly by surface-level cues like hedging and elaboration that judges read as "more careful." In production, this means a prompt that adds three sentences of throat-clearing will score higher than a prompt that gets to the point.
Position bias. In pairwise comparisons (A vs. B), the position of the candidate in the prompt changes the verdict. Some judges favor the first option, some the second, and the bias intensity depends on the model and the task. The clean signal is order asymmetry: run A-then-B, then B-then-A, and count only the verdicts where both orderings agree. The disagreement rate is your bias floor. On a closely matched pair, you commonly see 20–40% of verdicts flip when you swap positions, which means up to 40% of your "wins" were artifacts of ordering.
Format bias (also called familiarity or self-preference bias). Judges score outputs higher when those outputs look like what the judge itself would have produced. Self-preference bias has been measured directly: GPT-4 systematically rates GPT-4 outputs higher than humans rate them, and the effect correlates with perplexity — judges prefer text that is statistically familiar to them. In practice, this shows up as a markdown bias (formatted bullet lists score higher than equivalent prose), a structure bias (numbered headings score higher than flowing arguments), and a stylistic family bias (a judge from one provider quietly down-scores outputs from another, even when humans prefer the latter).
Each of these biases is small in isolation. Compose them across a six-week prompt-iteration loop and you produce a Goodhart machine.
The Audit Methodology That Catches This in an Afternoon
Three controls, run once, are enough to know whether your judge is reliable.
Length-controlled pairs. Take a sample of your existing eval set. For each item where the judge picked one response over another, generate a length-matched variant: truncate the longer response to the shorter one's token count, or expand the shorter one with neutral filler. Re-run the judge. The fraction of verdicts that flip after length-matching is your length-bias rate. Anything above 10% means length is a meaningful confound in your scores. The fix is not to ban long responses — it is to make the judge score against a length-controlled rubric, or to apply an AlpacaEval-style regression correction.
Position-swapped controls. For every pairwise prompt, run it twice: A-then-B, and B-then-A. Discard verdicts where the two orderings disagree. The discarded fraction is your position-bias rate. The kept verdicts are your real signal. This roughly doubles your judge cost and cuts your sample size, and it is non-negotiable for any pairwise comparison you put weight on. Single-direction pairwise scores in production-grade evals are an unforced error.
Format-stripped comparisons. Render each candidate response through a Markdown-stripping normalizer that removes headings, bullets, bold, and tables, leaving only plain prose. Re-run the judge against the stripped versions. The score delta between formatted and stripped is your format-bias contribution. If the delta is large, your judge is rewarding presentation more than substance, and any time your prompt iteration adds more bullets or headings, you will see "improvement" that is not there.
These three audits each take a few hundred extra judge calls and a couple of hours of work. They tell you the noise floor of your evaluator. Without them, every prompt-iteration delta below the bias floor is statistically meaningless.
The Calibration Discipline That Keeps It Honest
A one-time audit at setup is not enough. Two things drift.
The first is the judge itself. Provider-side model updates change behavior — the same judge prompt against the same eval set on gpt-4o-2024-08-06 and gpt-4o-2024-11-20 does not produce the same scores. If you pin a judge model, you accept staleness. If you let it float, you accept silent drift. Either way, you need a recurring calibration check.
The second is the task. The eval set you wrote at launch reflects the failure modes you knew about then. As your product matures, the distribution of real user inputs shifts, and so do the failure modes. Six months in, your judge may be perfectly calibrated against an eval set that no longer represents the workload.
The discipline that handles both is a held-out human-preference panel, refreshed quarterly. The structure is simple:
- https://aclanthology.org/2025.ijcnlp-long.18.pdf
- https://llm-judge-bias.github.io/
- https://arxiv.org/abs/2406.07791
- https://arxiv.org/html/2410.21819v2
- https://openreview.net/forum?id=CybBmzWBX0
- https://arxiv.org/html/2404.04475v1
- https://arxiv.org/html/2407.01085v3
- https://www.adaptive-ml.com/post/fair-fight
- https://eugeneyan.com/writing/llm-evaluators/
- https://www.evidentlyai.com/llm-guide/llm-as-a-judge
- https://arize.com/llm-as-a-judge/
- https://arxiv.org/abs/2403.16950
