Inside the RAG Arena: When the Judges Don't Agree

Building in Public — RAG Arena Diaries

The whole experiment is two equations and one footnote.

Per-variant overall score, mean over all $S$ scorer rubrics applied to all $N$ scored items:

$$ \overline{x}v ;=; \frac{1}{N \cdot S} \sum{i=1}^{N} \sum_{s=1}^{S} x_{v,i,s} \quad\text{where}\quad x_{v,i,s} \in [0, 1]. $$

Inter-judge agreement, the rank correlation between two LLM judges $j_a$ and $j_b$ scoring the same $n$ items under the same rubric:

$$ \rho(j_a, j_b) ;=; 1 ;-; \frac{6 \sum_{i=1}^{n} d_i^{,2}}{n,(n^2 - 1)} \quad\text{where}\quad d_i ;=; \operatorname{rank}(s_{j_a, i}) ;-; \operatorname{rank}(s_{j_b, i}). $$

The footnote: we measured ρ for two of the judges in our default rotation and got 0.552. Below the 0.85 threshold we use for human-anchored calibration (prior post here). Below the 0.70 floor we’d treat as “moderate” agreement. The two judges aren’t measuring the same underlying quality.

That doesn’t make the headline arena score wrong. It makes it conditional. v2-atd 0.585 vs Llama 4 Scout 0.582 is a real measurement of how gemini-3.1-flash-lite-preview ranks the variants on the AskTheDoctor corpus with EXIT 4× compression. The corrected interpretation isn’t “v2-atd is tied with Llama 4 Scout.” It’s “they’re tied under this judge, and we now know we don’t know what they look like under a different one.”

This is the post.

The setup

We wanted one number. The product question we were trying to answer:

For the AskTheDoctor (ATD) medical Q&A corpus, with the production RAG vector group already deployed, which model + retrieval configuration produces the best answer for a typical patient question?

The arena:

Suite: 200 held-out ATD validation items (69ec98bcc410b34ce668849a)
Three models: v2-atd (our Modal QLoRA Gemma 4 31B fine-tune), Llama 4 Scout, Opus 4.7
Two RAG configurations: uncompressed retrieval (Arena A) and EXIT 4× compressed retrieval (Arena B), with the prod RAG vector group at top-K=5, min-score=0.62
Persona prefix attached at the release level (Release A — Fuhrman persona)
Four scorers at equal weight = 1.0:
- llm-correctness — does the response match the gold facts (legacy, strict)
- llm-completeness-coverage — covers the gold’s key points, length-tolerant (PR #1001)
- llm-relevance — does the response address the question
- reference-perplexity:v2-atd-ppl — forced-decoding PPL of the response under v2-atd, length-normalized (PR #1000)
Judge for the LLM-judged scorers: gemini-3.1-flash-lite-preview (the default — calibration pending at the time)

EXIT 4× was added because v2-atd’s effective context window for our prod-shaped prompts is ~1600 tokens once the persona prefix and chat scaffolding are subtracted. Uncompressed top-5 retrieval blew past it on most items. The hypothesis we wanted to test was whether this was a v2-atd quality problem or a context-fit problem. If EXIT 4× compression brought retrieval inside the window without dropping accuracy, the answer was the latter.

EXIT 4× compression, simplified: keep the top-K retrieved chunks but compress each chunk by a factor of 4 using an extractive transformer (the EXIT model) before injecting into the prompt. Per the prior internal V&V, the expected accuracy delta is ~0pp at the 4× level for retrieval-augmented Q&A.

The headline result (and the per-scorer breakdown)

Arena B — EXIT 4× compressed RAG, error-filtered:

Variant	n_total	n_errors	n_valid	mean (4-scorer overall)
v2-atd (Modal QLoRA Gemma 4 31B)	212	28	212	0.585
Llama 4 Scout	203	17	203	0.582
~~Opus 4.7~~	219	221	0	RETRACTED — every response was a Vertex 404

Per-scorer breakdown — the actually-interesting result:

Scorer	v2-atd	Llama 4 Scout	Δ (v2-atd − Llama)
llm-completeness-coverage (PR #1001, length-tolerant)	0.565	0.534	+0.031
llm-correctness (legacy strict)	0.443	0.458	-0.015
llm-relevance	0.635	0.635	0.000
reference-perplexity:v2-atd-ppl (PR #1000, forced-decoding PPL)	0.700	0.699	+0.001

The story the legacy 3-scorer headline missed. When we measure with the length-tolerant completeness scorer (the one specifically designed to not punish thoroughness), v2-atd wins by +3.1pp on completeness-coverage. The legacy llm-correctness scorer (which is strict and brevity-biased) gives Llama a -1.5pp edge on facts-match. The two cancel out in the unweighted overall mean (0.585 vs 0.582), but the underlying signal isn’t “tied — they’re equivalent.” The signal is “v2-atd is more thorough per-response; Llama is marginally more strictly accurate on individual facts.”

Reference-perplexity is essentially tied (0.700 vs 0.699) — both models produce text in-distribution under v2-atd’s own decoder. Expected: Llama’s outputs aren’t dramatically different from v2-atd’s stylometrically when both are constrained by the same RAG context.

Per-scorer arena breakdown for v2-atd (Modal QLoRA Gemma 4 31B) vs Llama 4 Scout on EXIT 4× compressed RAG. The length-tolerant llm-completeness-coverage scorer (PR #1001) shows v2-atd at 0.565 and Llama at 0.534 — a +3.1pp edge to v2-atd that the headline overall mean (0.585 vs 0.582) hides. The legacy strict llm-correctness scorer gives Llama a marginal -1.5pp edge. Relevance is tied at 0.635. Reference-perplexity under v2-atd is essentially tied at 0.700 vs 0.699. — The headline tie isn't "they're equivalent." It's "v2-atd is more thorough per-response; Llama is marginally more strictly accurate on individual facts; the unweighted mean averages those two opposite shapes into the same number." The length-tolerant scorer (PR #1001) is what surfaces the underlying signal.

Compression sensitivity: the EXIT 4× claim, validated on a customer corpus

Arena C measures the same model under both retrieval configurations:

Variant	Arena A mean (uncompressed)	Arena B mean (EXIT 4×)	Δ
Llama 4 Scout	0.578	0.582	+0.004

$$ \Delta_{\text{compression}} ;=; \overline{x}_v^{,\text{EXIT}} ;-; \overline{x}_v^{,\text{uncompressed}} ;=; +0.004. $$

Essentially zero. The prior internal “0pp loss at 4×” V&V on synthetic data holds on a real customer corpus. EXIT 4× is doing what it’s supposed to: shrinking the retrieved context to fit downstream models without measurably degrading the answer the next stage produces.

That was the variable that moved v2-atd from “looks like it can’t do RAG” to “looks roughly tied with Llama 4 Scout.” The gap wasn’t the model; it was the prompt budget.

The Opus retraction: a silent-failure mode worth naming

Every Opus 4.7 row in the arena scored ~0.15 overall. Looked plausible — Opus is a strong model, but on an unfamiliar domain corpus a 0.15 mean isn’t impossible.

It was impossible because none of the responses were responses. A regional routing misconfiguration meant every Opus call returned an API error string instead of a generated answer, and the test runner counted non-exception calls as “passed” regardless of whether the response was a real answer or an error blob like Error generating response from anthropic-vertex: Publisher Model claude-opus-4-7 not found in us-central1.

The error strings then scored coherently on the rubrics. llm-relevance on a JSON error blob returns ~0 (correctly, the blob doesn’t address the question). But reference-perplexity:v2-atd-ppl returns ~0.79 on any short, well-formed English — and error JSON is short and well-formed. The headline mean for Opus stayed in the “weak but plausible” zone instead of the “obvious failure” zone.

The lesson is the lesson: a measurement pipeline that doesn’t loudly distinguish “model produced a bad answer” from “model wasn’t called and we captured the error as if it were an answer” is producing numbers, not measurements. This is a generic risk for any LLM-as-judge pipeline that scores response text without first validating the response is a response. We retracted Opus from this run and added an error-response filter to the per-variant aggregator before re-publishing the headline. The retracted result wasn’t expensive in dollars; it was expensive in trust — for the few hours before we caught it, we had a number that looked like a finding.

The judge calibration finding: ρ = 0.552

After the arena, we cross-checked ourselves. Sampled 30 (response, gold) pairs stratified across low/mid/high prior-score bins. Re-scored each pair with two judges using the exact llm-completeness-coverage prompt:

Trusted baseline: gemini-2.5-flash — the well-calibrated default per our prior judge-calibration work
Candidate upgrade: gemini-3.1-pro-preview — a thinking model with ~282 thoughts tokens/call, meaningful instruction-following capacity for nuanced rubrics

Then we did the inter-judge ρ matrix on a larger 415-item pool:

	`gemini-2.5-flash`	`gemini-3.1-pro`
`gemini-2.5-flash`	1.000 (n=415)	0.552 (n=413)
`gemini-3.1-pro`	0.552 (n=413)	1.000 (n=413)

ρ = 0.552 between two flagship Gemini judges scoring the same 413 answers under the same rubric.

For comparison, the threshold we use to trust a judge against a human anchor is ρ ≥ 0.85 (prior post on this). The floor we’d accept as “moderate consensus” is around 0.70. Two judges that nominally do the same job, given the same prompts and the same answers, are landing at 0.552.

Bar chart of pairwise Spearman ρ between three Gemini judges scoring the same answers under the same rubric. The four pairs measured: flash-lite-preview ↔ 2.5-flash at 0.750; 2.5-flash ↔ 3.1-pro on the calibration sample (n=28) at 0.679; 2.5-flash ↔ 3.1-pro on the larger 415-item pool (n=413) at 0.552 — the headline; flash-lite-preview ↔ 3.1-pro at 0.658. None clear the 0.85 human-anchor threshold (dashed sage line); three of four also fall below the 0.70 moderate-consensus floor (dashed amber line). Verdict box at bottom: don't adopt 3.1-pro as the new judge — re-running the arena would change the headline by an unknown amount and lose comparability with prior data. — The 0.552 isn't a defect of either judge. It's a measurement of how much "LLM-as-judge" is doing on this rubric. Until a human anchor exists, the arena report ships with its judge identity in the methods caveat — switching judges between runs would change two variables at once.

The direction of disagreement is consistent though:

flash-lite-preview is the most lenient (mean ~1.0 on items where both newer judges scored 0.25-0.5).
2.5-flash is moderate (mean 0.268 on the calibration sample).
3.1-pro-preview is the strictest (mean 0.214).

So the “right answer” is somewhere across all three — but we don’t have it without a human anchor.

What this means for the arena number

It does not invalidate the arena number. It contextualizes it. The 0.585 vs 0.582 headline is real — but the interpretation is:

Under gemini-3.1-flash-lite-preview as the judge for the LLM-judged scorers, on the AskTheDoctor 200-item held-out validation set with EXIT 4× compression on the prod RAG vector group, v2-atd scores 0.585 and Llama 4 Scout scores 0.582 on the unweighted 4-scorer mean. Re-running the same evaluation with gemini-2.5-flash would shift both numbers by an unknown amount (likely several percentage points each) and the gap may flip sign.

That’s a defensible measurement. It’s not a judgment-free measurement, and we now have to write that down every time we report it.

The decision we made: don’t re-run the arena under a different judge until we have a human-anchored calibration set. Switching judges between arena runs is methodologically invalid — you’d be changing two variables at once (model under test and yardstick) and any difference is attributable to either. Stay on gemini-3.1-flash-lite-preview as the established (if imperfect) baseline. Use the corrected aggregator (error filter + dynamic scorer enumeration) for all future runs. Add the methods caveat to every published headline.

The sustainable fix is a human-labeled gold benchmark for the completeness-coverage rubric. ~50 (response, gold) pairs with a domain expert’s ratings on a 5-point scale (${0, 0.25, 0.5, 0.75, 1.0}$). Each candidate judge’s ρ vs the human gold gives an actual truth-anchor instead of judge-vs-judge noise. The expert in our case is Dr. Joel Fuhrman; the calibration session is queued; the result will land as an addendum here and in the Calibrating the Judge post.

The routing layer: why the calibration session pays twice

Calibration sessions do two things at once. The first is the one above — they tell us which judge to trust. The second is the part that ships into production: each calibration question carries up to $V$ variants, and the rater picks one as best overall. That winner pick fires an asynchronous side effect:

on_winner_picked(question_class, winning_variant) →
    routing_layer.update_preference(
        question_class,
        rag_vector_group=winning_variant.rag_group,
    )

The next user query that the platform classifies into the same question_class biases toward the winning RAG vector group. Five winner picks later, the routing layer has learned the rater’s preference for that question class. Fifty winner picks later, it covers the major question classes in the suite. The same human, the same 50 answers, two production systems improved.

Concretely, the loop:

The rater answers the same question against (e.g.) three RAG vector groups: “core nutrition corpus” / “user-submitted Q&A” / “video transcripts.”
They pick the answer from the “core nutrition corpus” as best overall.
The next user asks a similar question → the platform routes there first.
Five ratings later, the routing has learned the rater’s preference for that class.

The upper-bound cost of this is the same 30 minutes of expert time the calibration session already takes. The marginal cost of adding the routing-update side effect is one extra column in the calibration UI and one async write to the routing-preference store per rated answer.

What’s pending

Two research threads we’re following:

Human-anchored judge calibration on the Fuhrman 50. Fifty held-out responses are queued for Dr. Fuhrman to rate on a 5-point scale (${0, 0.25, 0.5, 0.75, 1.0}$). Once his ratings land, we compute Spearman ρ between his vector and each candidate LLM judge’s vector. If ≥ 1 judge clears ρ ≥ 0.85 with n ≥ 30, we have a defensible default judge for nutrition Q&A. If none do, that is the news — it would mean off-the-shelf judges are insufficient for the domain and the next experiment is to fine-tune a custom judge against the human anchor.

Opus 4.7 re-test. Once the regional routing config is fixed, we re-run Arena A and Arena B with Opus included. The retraction was a config bug, not an Opus quality claim, and a three-model arena with both Opus and v2-atd against the same corpus is a more interesting comparison than the two-model fallback we had to publish.

In summary

The arena answered the product question we asked. v2-atd is essentially tied with Llama 4 Scout on the AskTheDoctor corpus when both have RAG access at a configuration that fits v2-atd’s prompt budget. The “v2-atd can’t do RAG” hypothesis was a context-window blocker; EXIT 4× resolves it with measurably zero accuracy cost.

The arena also answered a question we didn’t ask and would have preferred to keep tacit: two flagship LLM judges scoring the same answers under the same rubric land at ρ = 0.552. That’s not a defect of either judge. It’s a measurement of how much “LLM-as-judge” is doing — and it’s the case for every arena report carrying its judge identity in the methods caveat until a human anchor exists.

When that anchor lands, the same calibration session pays for itself twice: the LLM judges get blessed (or not) for scoring the rest of the suite, and the production routing layer learns which RAG vector group the human prefers for each question class. One human session, two systems improved.

The scaffolding to do this — the four-surface calibration session (Web UI, CLI, MCP, Divinci Agent), the four-scorer arena, the dynamic aggregator, the error filter — is shipped. The doctor has the link.

Raw data and reproducibility: arena results in notebooks/arena_results/, corrected aggregation script at reaggregate_2026_04_25.py, inter-judge ρ matrix script at inter_judge_rho_matrix.py. Methods caveat sticks to every headline number until human-anchored calibration lands.

Building Divinci in public. Companion post on the calibration step itself: Calibrating the Judge: The Grader get Graded.

Ready to Build Your Custom AI Solution?

Discover how Divinci AI can help you implement RAG systems, automate quality assurance, and streamline your AI development process.

Get Started Today