Why You Can't Trust LLM-as-a-Judge Scores (Yet)

A 0.82 that means nothing

Take a state-of-the-art GPT-4 judge, hand it two answers, and ask which is better. Now swap the order of those two answers and ask again. It changes its mind about a third of the time — the original MT-Bench study measured the same verdict on just 65% of order-swapped pairs. Weaker judges were worse than a coin flip: Claude-v1 held its verdict only 23.8% of the time.

Here's the one that should keep you up at night. On AlpacaEval, researchers moved a model from a 22.9% to a 64.3% win rate without improving a single answer — they just told it to write longer responses. The judge rewarded verbosity, not quality.

If you're running LLM-as-a-judge evals in production — and you should be — none of this means the technique is broken. It means a raw judge score is a measurement with no error bars. A "0.82 helpfulness" reads like a fact. Until you know how that judge behaves under bias, it's closer to a vibe with a decimal point.

This post is about closing that gap: what the biases actually are, how to measure whether your judge agrees with reality, and the handful of fixes that turn an LLM judge from a confident guesser into an instrument you can calibrate.

The biases have names (and numbers)

LLM judges fail in specific, reproducible ways. The 2024 "Justice or Prejudice?" survey catalogued twelve distinct bias types. Four matter most in practice.

Position bias. The order you present answers in changes the verdict. As above, GPT-4 agreed with itself on only 65% of swaps; the effect is strongest on close calls, which are exactly the cases where you most need the judge to be right. Practitioner audits report 20–40% verdict flips on closely matched pairs (treat those specific numbers as illustrative — the peer-reviewed anchor is the 65%/35% overall figure).

Verbosity bias. Longer looks better. In MT-Bench's "repetitive list attack" — rephrasing an answer to be longer without adding information — GPT-3.5 and Claude-v1 were fooled 91% of the time. Note the asymmetry: GPT-4 fell for it only 8.7%. Frontier judges are far more length-robust than mid-tier ones, so "judges love length" is model-dependent, not a law.

Self-preference. Judges tend to like outputs from their own model family. MT-Bench measured GPT-4 giving its own completions a ~10% win-rate bump and Claude-v1 a ~25% bump. Again, not universal — GPT-3.5 showed no self-favoritism — but enough that grading GPT-4o output with a GPT-4o judge deserves a skeptical eye.

Sycophancy. Push back on a judge and it caves. A 2025 EMNLP study found sustained argumentative pressure triggered sycophantic reversals at roughly 3× the rate of a neutral question. This matters most in conversational and agentic evals where the "user" turn can lean on the model.

The throughline: every one of these is a property of the judge as a measuring instrument, not of the outputs it's measuring. Which means you can characterize it — and once you can measure a bias, you can correct for it.

Your judge score needs a denominator

Before you fix anything, you need to know how wrong your judge is today. The honest metric is agreement with human labels, chance-corrected — Cohen's kappa, not raw agreement percentage. Raw agreement flatters you because it counts the times the judge and a human both said "good" by luck.

What does good look like? The 2025 "Judge's Verdict" study put the best judges at kappa 0.78–0.82 against a human-human baseline of 0.80 — right at the boundary between "substantial" and "almost perfect" agreement. The canonical MT-Bench result is that GPT-4 reaches 85% agreement with humans, matching the 81% humans reach with each other. So a well-built judge genuinely can be as consistent as a human annotator.

The catch is that "well-built" is doing a lot of work in that sentence, and you only know which side of the line you're on if you've actually computed the number. The practitioner consensus is to require 85–90% judge-vs-expert agreement before trusting a judge for anything that gates a release, and to treat kappa below 0.6 as a production alarm.

You cannot get that number from the judge alone. You need a set of human-labeled examples to grade the judge against — a gold set. That requirement turns out to be the whole game, and we'll come back to it.

The fixes that actually move the number

The good news from 2026's research is that the cheap, obvious mitigations stack — and they compound. The April 2026 "Judging the Judges" benchmark tested nine strategies and found the combination of position-swapping + chain-of-thought + a rubric lifted Claude Sonnet 4 agreement by +11.2 points (p < 0.0001), the single biggest gain. Eighteen of twenty configurations improved over baseline. Here's what's worth doing, roughly in order of return.

Make the judge reason before it scores. Chain-of-thought was "universally beneficial" in that benchmark — +7.2 points alone, +13 on adversarial data. Ask for the reasoning first, then the verdict, never the other way around (a score followed by post-hoc justification is theater).

Replace the 1-to-10 score with a rubric. Bare numeric scores are unstable and drift between runs. Decompose quality into explicit, separable criteria and have the judge fill in each — the G-Eval approach. This is also what makes a score auditable: "0.4 because it missed criterion 3" beats "0.4 because."

# Fragile: invites verbosity and position bias
Rate this answer's helpfulness from 1 to 10.

# Better: criteria-based, reasoning-first, scoped
For each criterion, give a verdict and one sentence of evidence
from the answer, THEN a 0/1 score:
  1. Directly addresses the user's question
  2. Factually grounded in the provided context
  3. No unsupported claims
Output JSON: {reasoning, criteria: [...], pass: bool}

Swap positions and require agreement. For pairwise comparisons, call the judge twice with the answers in both orders. Only count a win if it survives both; otherwise it's a tie. This directly neutralizes position bias for the cost of one extra call.

Pick pointwise or pairwise deliberately. Pairwise tracks human preference better but carries position bias. Pointwise (score each answer alone) is more stable for dashboards but fluctuates run to run. A common split: pointwise for continuous monitoring, pairwise with position-swapping for release gates.

Use a panel, not a pundit. "Replacing Judges with Juries" showed a panel of three smaller models from different families, with majority vote, beating a single large judge — with less per-model bias and over 7× lower cost. Diversity of judge cancels family-specific self-preference in a way one big judge never can.

None of these require new infrastructure. They're prompt-and-orchestration changes. The piece that does require infrastructure is the one that makes all of them trustworthy over time.

The part everyone skips: calibration and drift

Every technique above improves your judge at a point in time. The problem is that judges don't stay put. Providers update models under you, your traffic distribution shifts, and a judge prompt that scored a clean 0.8 kappa in March silently inflates or deflates by summer. Braintrust and others put the practical drift window at 60–90 days without recalibration.

So a trustworthy eval isn't a judge prompt. It's a loop:

Maintain a gold set — 200–500 human-labeled traces per rubric, drawn from real traffic, not synthetic fixtures. Refresh the stalest chunk quarterly.
Score the gold set with your judge on a schedule — monthly is typical.
Compute judge-vs-human kappa. Alert if it drops below 0.6.
When it drifts, recalibrate — adjust the rubric, swap the judge model, or add few-shot examples until agreement recovers.

This is the 2026 best-practice consensus, and it maps onto a three-tier strategy that keeps cost sane:

Tier 1 — deterministic checks on everything: JSON validity, format, length, forbidden strings. Free, instant, perfectly reliable.
Tier 2 — LLM judge on a 5–20% sample, where a real rubric exists.
Tier 3 — humans on a small gold set, for high-stakes calls and to calibrate Tiers 1 and 2.

The judge does the volume. The humans don't grade everything — they grade the judge. That's the only way a score with a decimal point earns the trust you're putting in it.

How Grepture helps

Grepture runs LLM-as-a-judge evals on your real production traffic, not a synthetic test suite — which means the substrate for everything above is already in place.

We're not going to pretend the judge is infallible; this whole post is the argument against that. What the gateway gives you is the ability to verify it:

Every score ships with the judge's written reasoning, inline in the traffic log next to model, tokens, cost, and latency. A score you can read the reasoning for is a score you can audit for the biases above — and you can move the judge from a bare 1-to-10 to a criteria-based rubric prompt in the evaluator config.
Sampling and filters are built in, so the Tier 2 "judge on 5–20%" pattern is a slider, not a pipeline you have to build.
Your gold set comes from your own logs. Because the real requests and responses are already captured, you can turn production traffic into labeled datasets and use those human-labeled traces as the ground truth you measure the judge against — the calibration step that's otherwise the hardest to stand up.
Scores follow prompt versions. If you're using prompt management, every prompt change carries a continuous quality signal, so you catch the regression a judge would otherwise paper over.

The gateway already sees every request. Calibration — comparing what the judge said to what a human said, on traffic that actually happened — is a natural extension of that, not a separate system to integrate.

If you want quality scores you can actually defend, start where the traffic already is. See how Grepture's evals work, or compare LLM observability approaches before you build your own.