AI Reviewer Comparison — ICLR Experiment Dashboard

Experiment Overview

Nine reviewer configurations were tested across 100 ICLR papers. Methods range from simple role-prompt baselines to dynamic persona generation and mixed-panel AC modes. Each configuration uses 4 reviewer slots (except Aug variants which use 5–6), followed by an Area Chair (AC) that produces a final accept/reject decision.

Best Accept/Reject Accuracy

80.0%

Mix(2V+2D) — balanced hybrid panel

Best Cohen's Kappa

0.593

Mix(2V+2D) — moderate-strong agreement

Lowest Cost Per Paper

$0.0777

FP-Archetype — no web search overhead

Vanilla

4× identical neutral reviewer prompts. Primary baseline.

FP-Attitude

Fixed attitudinal roles: default, critical, empiricist, pedagogical.

FP-Archetype

Fixed archetypes: bluffer, critic, expert, skimmer.

DP-Full

Dynamic paper-specific persona (280–380 words) via web search.

Mix(1V+3D)

Mixed panel: 1 Vanilla + 3 DP-Full reviewers.

Mix(2V+2D) ★

Mixed panel: 2 Vanilla + 2 DP-Full. Best overall.

Mix(3V+1D)

Mixed panel: 3 Vanilla + 1 DP-Full reviewers.

Aug(4D+1V)

5-reviewer panel: 4 DP-Full + 1 Vanilla. Tests scaling.

Aug(4D+2V)

6-reviewer panel: 4 DP-Full + 2 Vanilla. Larger committee.

Multi-Metric Overview — All Methods

Radar comparison of five key metrics normalized 0–1. Accuracy and Kappa from Decision Metrics; Spearman from Mean Reviewer Rating alignment; Mean Std reflects opinion diversity; Cost-efficiency is inverse cost normalized.

NORMALIZED · 0–1

Each axis represents a different quality dimension. A larger polygon area indicates better overall performance across all dimensions simultaneously.

Decision Metrics

Binary accept/reject classification performance of the AC against human editorial decisions. ICLR acceptance threshold is rating ≥ 6 on a 1–10 scale. Accuracy measures overall agreement; Balanced Accuracy is robust to class imbalance; Cohen's κ removes chance agreement; Prec(A) and Rec(A) measure precision and recall for accepted papers; Spec(R) is specificity for rejections; F1(A) harmonizes accept precision and recall.

Accept / Reject Accuracy & Balanced Accuracy

Balanced Accuracy (avg of recall for both classes) corrects for accept/reject imbalance. Mix(2V+2D) achieves the highest value on both metrics.

↑ HIGHER IS BETTER

Cohen's Kappa (κ)

Agreement beyond chance. κ > 0.4 = moderate, κ > 0.6 = substantial. Mix(2V+2D) is the only method to cross the 0.6 threshold.

Precision · Recall · Specificity · F1 (Accept class)

FP-Archetype and FP-Attitude excel at specificity (rejecting bad papers correctly). Mix(2V+2D) leads F1.

Decision Metrics — Full Table

Click column headers to sort

SORTABLE ↕

Method ↕	Acc ↑ ↕	Bal.Acc ↑ ↕	Kappa ↑ ↕	Prec(A) ↑ ↕	Rec(A) ↑ ↕	Spec(R) ↑ ↕	F1(A) ↑ ↕

Score Alignment vs. Human

How closely LLM-generated numeric scores track human reviewer scores. MAE/RMSE measure absolute error (lower is better). Bias (Mean Bias Error) indicates systematic over/under-estimation — positive means the model overestimates, closer to 0 is better. Pearson/Spearman measure linear and rank correlation (higher is better). Aspect scores (Soundness, Presentation, Contribution) are on a 1–4 scale.

Mean Reviewer Rating — MAE & Pearson Correlation

Mix(1V+3D) and Mix(2V+2D) achieve the best Pearson correlation (0.55+). FP-Attitude leads MAE. DP-Full has highest positive bias (tends to over-score).

REVIEWER RATINGS

Mean Reviewer Rating Alignment

SORTABLE ↕

Method ↕	MAE ↓ ↕	RMSE ↓ ↕	Bias →0 ↕	Pearson ↑ ↕	Spearman ↑ ↕

Reviewer Agreement & Opinion Diversity

Intra-paper standard deviation of the four reviewer ratings measures opinion diversity. Higher std = more diverse perspectives, which can improve robustness but lowers consensus. Consensus Rate = fraction of papers where all four reviewers agree on accept/reject direction. There is a fundamental tension: diverse panels surface more disagreement, but may confuse the AC.

Mean Intra-Paper Rating Std (Opinion Diversity)

FP-Archetype produces 3.5× more opinion variance than Vanilla, reflecting its heterogeneous reviewer roles (bluffer/critic/expert/skimmer).

Consensus Rate (All Reviewers Agree)

Vanilla achieves 79% consensus (uniform prompts). FP-Archetype drops to 59% — diverse archetypes produce more disagreement.

Diversity vs. Decision Accuracy Trade-off

Scatter plot: x-axis = mean opinion std (diversity), y-axis = Accuracy. The sweet spot is not maximum diversity — Mix(2V+2D) balances moderate diversity (0.33 std) with best accuracy (0.80).

SCATTER · TRADE-OFF

Reviewer Agreement — Full Table

SORTABLE ↕

Method ↕	Mean Std ↑ ↕	Median Std ↑ ↕	Min Std	Max Std	Consensus Rate ↑ ↕

Cost & Rating Distribution

API cost includes PDF uploads, 4 reviewer calls, and 1 AC call per paper. DP-Full is most expensive due to web search in persona generation. Mixed-panel methods reuse pre-computed reviews, incurring only the AC call overhead. Rating distribution reveals systematic over/under-estimation relative to ICLR's 1–10 scale (threshold at 6).

Total Cost (USD · 100 papers)

FP-Archetype is cheapest at $7.77. DP-Full is the most expensive ($8.13) due to web search. Mixed variants reuse existing reviews, matching Vanilla cost.

Mean Reviewer Rating Distribution

All methods score slightly below the acceptance threshold of 6.0, indicating conservative rating behavior. FP-Archetype is most conservative (4.70 mean).

Cost vs. Decision Accuracy — Value Frontier

The ideal method sits top-left: lowest cost, highest accuracy. Mix(2V+2D) achieves the best accuracy at baseline cost. DP-Full pays more but underperforms.

VALUE FRONTIER

Key Insights & Recommendations

Summary of the most actionable findings across all experimental conditions.

INSIGHT 01

Balanced hybrid panels win overall

Mix(2V+2D) achieves Acc=0.800, κ=0.593, and F1=0.825 — the best across all primary decision metrics. Combining 2 stable Vanilla reviewers with 2 paper-specific Dynamic personas gives the AC a balanced signal: baseline reliability + targeted expertise.

INSIGHT 02

DP-Full alone is a trap: high cost, lower accuracy

Despite generating the most sophisticated per-paper personas (280–380 words + web search), DP-Full's stand-alone accuracy (0.700) is the worst of all methods — lower even than Vanilla. Its Recall(A) is highest (0.818), meaning it over-accepts. The dynamic persona benefit only manifests when diluted by stable reviewers.

INSIGHT 03

Fixed archetypes maximize diversity and specificity

FP-Archetype generates the highest opinion variance (std=0.773) and ties for best Spec(R) (0.867) — the best method for correctly rejecting bad papers. The bluffer/critic/expert/skimmer mix forces AC to resolve genuine disagreement, improving rejection precision while keeping cost at minimum.

INSIGHT 04

FP-Attitude dominates AC score alignment

For the task of predicting the AC's numeric score, FP-Attitude leads with MAE=0.917, Spearman=0.539, and near-zero bias (0.187). Its attitudinal roles (critical, empiricist, pedagogical) produce score distributions closest to how human ACs synthesize reviewer inputs.

INSIGHT 05

Scaling the panel beyond 4 reviewers does not help

Aug(4D+1V) and Aug(4D+2V) both underperform Mix(2V+2D) on every primary metric despite using 5–6 reviewers instead of 4. Adding more Dynamic personas introduces noise rather than signal — the AC appears to be overwhelmed by conflicting high-variance opinions, reducing consensus rate to 56–57%.

INSIGHT 06

The diversity–accuracy sweet spot is around std=0.33

The scatter of diversity vs. accuracy shows a non-linear relationship. Too little diversity (Vanilla, std=0.22) leaves borderline papers unresolved. Too much (FP-Archetype, std=0.77) creates conflicting signals the AC cannot cleanly aggregate. Mix(2V+2D) hits the sweet spot at std=0.333 with best accuracy.

Overall Ranking — Weighted Score (Decision 50% + Alignment 30% + Cost-efficiency 20%)

Composite ranking normalizes each metric to [0,1] and applies practical weights. This is a convenience summary — the optimal method depends on your primary objective (accuracy vs. score alignment vs. cost).

COMPOSITE RANK

AI Reviewer Persona Comparison Dashboard