Experiment Results · ICLR Peer Review Simulation

AI Reviewer Persona Comparison Dashboard

100 papers per run 9 reviewer configurations ICLR acceptance threshold ≥ 6 OpenAI API · GPT-4o
Experiment Overview
Nine reviewer configurations were tested across 100 ICLR papers. Methods range from simple role-prompt baselines to dynamic persona generation and mixed-panel AC modes. Each configuration uses 4 reviewer slots (except Aug variants which use 5–6), followed by an Area Chair (AC) that produces a final accept/reject decision.
Best Accept/Reject Accuracy
80.0%
Mix(2V+2D) — balanced hybrid panel
Best Cohen's Kappa
0.593
Mix(2V+2D) — moderate-strong agreement
Lowest Cost Per Paper
$0.0777
FP-Archetype — no web search overhead
Vanilla
4× identical neutral reviewer prompts. Primary baseline.
FP-Attitude
Fixed attitudinal roles: default, critical, empiricist, pedagogical.
FP-Archetype
Fixed archetypes: bluffer, critic, expert, skimmer.
DP-Full
Dynamic paper-specific persona (280–380 words) via web search.
Mix(1V+3D)
Mixed panel: 1 Vanilla + 3 DP-Full reviewers.
Mix(2V+2D) ★
Mixed panel: 2 Vanilla + 2 DP-Full. Best overall.
Mix(3V+1D)
Mixed panel: 3 Vanilla + 1 DP-Full reviewers.
Aug(4D+1V)
5-reviewer panel: 4 DP-Full + 1 Vanilla. Tests scaling.
Aug(4D+2V)
6-reviewer panel: 4 DP-Full + 2 Vanilla. Larger committee.
Multi-Metric Overview — All Methods
Radar comparison of five key metrics normalized 0–1. Accuracy and Kappa from Decision Metrics; Spearman from Mean Reviewer Rating alignment; Mean Std reflects opinion diversity; Cost-efficiency is inverse cost normalized.
NORMALIZED · 0–1
Each axis represents a different quality dimension. A larger polygon area indicates better overall performance across all dimensions simultaneously.
Decision Metrics
Binary accept/reject classification performance of the AC against human editorial decisions. ICLR acceptance threshold is rating ≥ 6 on a 1–10 scale. Accuracy measures overall agreement; Balanced Accuracy is robust to class imbalance; Cohen's κ removes chance agreement; Prec(A) and Rec(A) measure precision and recall for accepted papers; Spec(R) is specificity for rejections; F1(A) harmonizes accept precision and recall.
Accept / Reject Accuracy & Balanced Accuracy
Balanced Accuracy (avg of recall for both classes) corrects for accept/reject imbalance. Mix(2V+2D) achieves the highest value on both metrics.
↑ HIGHER IS BETTER
Cohen's Kappa (κ)
Agreement beyond chance. κ > 0.4 = moderate, κ > 0.6 = substantial. Mix(2V+2D) is the only method to cross the 0.6 threshold.
Precision · Recall · Specificity · F1 (Accept class)
FP-Archetype and FP-Attitude excel at specificity (rejecting bad papers correctly). Mix(2V+2D) leads F1.
Decision Metrics — Full Table
Click column headers to sort
SORTABLE ↕
Method Acc ↑ Bal.Acc ↑ Kappa ↑ Prec(A) ↑ Rec(A) ↑ Spec(R) ↑ F1(A) ↑
Score Alignment vs. Human
How closely LLM-generated numeric scores track human reviewer scores. MAE/RMSE measure absolute error (lower is better). Bias (Mean Bias Error) indicates systematic over/under-estimation — positive means the model overestimates, closer to 0 is better. Pearson/Spearman measure linear and rank correlation (higher is better). Aspect scores (Soundness, Presentation, Contribution) are on a 1–4 scale.
Mean Reviewer Rating — MAE & Pearson Correlation
Mix(1V+3D) and Mix(2V+2D) achieve the best Pearson correlation (0.55+). FP-Attitude leads MAE. DP-Full has highest positive bias (tends to over-score).
REVIEWER RATINGS
Mean Reviewer Rating Alignment
SORTABLE ↕
Method MAE ↓ RMSE ↓ Bias →0 Pearson ↑ Spearman ↑
Reviewer Agreement & Opinion Diversity
Intra-paper standard deviation of the four reviewer ratings measures opinion diversity. Higher std = more diverse perspectives, which can improve robustness but lowers consensus. Consensus Rate = fraction of papers where all four reviewers agree on accept/reject direction. There is a fundamental tension: diverse panels surface more disagreement, but may confuse the AC.
Mean Intra-Paper Rating Std (Opinion Diversity)
FP-Archetype produces 3.5× more opinion variance than Vanilla, reflecting its heterogeneous reviewer roles (bluffer/critic/expert/skimmer).
Consensus Rate (All Reviewers Agree)
Vanilla achieves 79% consensus (uniform prompts). FP-Archetype drops to 59% — diverse archetypes produce more disagreement.
Diversity vs. Decision Accuracy Trade-off
Scatter plot: x-axis = mean opinion std (diversity), y-axis = Accuracy. The sweet spot is not maximum diversity — Mix(2V+2D) balances moderate diversity (0.33 std) with best accuracy (0.80).
SCATTER · TRADE-OFF
Reviewer Agreement — Full Table
SORTABLE ↕
Method Mean Std ↑ Median Std ↑ Min Std Max Std Consensus Rate ↑
Cost & Rating Distribution
API cost includes PDF uploads, 4 reviewer calls, and 1 AC call per paper. DP-Full is most expensive due to web search in persona generation. Mixed-panel methods reuse pre-computed reviews, incurring only the AC call overhead. Rating distribution reveals systematic over/under-estimation relative to ICLR's 1–10 scale (threshold at 6).
Total Cost (USD · 100 papers)
FP-Archetype is cheapest at $7.77. DP-Full is the most expensive ($8.13) due to web search. Mixed variants reuse existing reviews, matching Vanilla cost.
Mean Reviewer Rating Distribution
All methods score slightly below the acceptance threshold of 6.0, indicating conservative rating behavior. FP-Archetype is most conservative (4.70 mean).
Cost vs. Decision Accuracy — Value Frontier
The ideal method sits top-left: lowest cost, highest accuracy. Mix(2V+2D) achieves the best accuracy at baseline cost. DP-Full pays more but underperforms.
VALUE FRONTIER
Key Insights & Recommendations
Summary of the most actionable findings across all experimental conditions.
INSIGHT 01
Balanced hybrid panels win overall
Mix(2V+2D) achieves Acc=0.800, κ=0.593, and F1=0.825 — the best across all primary decision metrics. Combining 2 stable Vanilla reviewers with 2 paper-specific Dynamic personas gives the AC a balanced signal: baseline reliability + targeted expertise.
INSIGHT 02
DP-Full alone is a trap: high cost, lower accuracy
Despite generating the most sophisticated per-paper personas (280–380 words + web search), DP-Full's stand-alone accuracy (0.700) is the worst of all methods — lower even than Vanilla. Its Recall(A) is highest (0.818), meaning it over-accepts. The dynamic persona benefit only manifests when diluted by stable reviewers.
INSIGHT 03
Fixed archetypes maximize diversity and specificity
FP-Archetype generates the highest opinion variance (std=0.773) and ties for best Spec(R) (0.867) — the best method for correctly rejecting bad papers. The bluffer/critic/expert/skimmer mix forces AC to resolve genuine disagreement, improving rejection precision while keeping cost at minimum.
INSIGHT 04
FP-Attitude dominates AC score alignment
For the task of predicting the AC's numeric score, FP-Attitude leads with MAE=0.917, Spearman=0.539, and near-zero bias (0.187). Its attitudinal roles (critical, empiricist, pedagogical) produce score distributions closest to how human ACs synthesize reviewer inputs.
INSIGHT 05
Scaling the panel beyond 4 reviewers does not help
Aug(4D+1V) and Aug(4D+2V) both underperform Mix(2V+2D) on every primary metric despite using 5–6 reviewers instead of 4. Adding more Dynamic personas introduces noise rather than signal — the AC appears to be overwhelmed by conflicting high-variance opinions, reducing consensus rate to 56–57%.
INSIGHT 06
The diversity–accuracy sweet spot is around std=0.33
The scatter of diversity vs. accuracy shows a non-linear relationship. Too little diversity (Vanilla, std=0.22) leaves borderline papers unresolved. Too much (FP-Archetype, std=0.77) creates conflicting signals the AC cannot cleanly aggregate. Mix(2V+2D) hits the sweet spot at std=0.333 with best accuracy.
Overall Ranking — Weighted Score (Decision 50% + Alignment 30% + Cost-efficiency 20%)
Composite ranking normalizes each metric to [0,1] and applies practical weights. This is a convenience summary — the optimal method depends on your primary objective (accuracy vs. score alignment vs. cost).
COMPOSITE RANK