Here are the full results.



\## Benchmark Results



| Model | Band | Score | Evidence | Attribution | Cross-Attr | Total Time | Avg Adherence | $/essay |

|-------|------|-------|----------|-------------|------------|------------|---------------|---------|

| \*\*gpt-4o-mini\*\* | 5 | 0.7562 | 18/18 | 10/18 (56%) | 8 | \*\*42s\*\* | 6.7s | $0.002 |

| \*\*gpt-4.1-mini\*\* | 4 | 0.6063 | 17/17 | \*\*12/17 (71%)\*\* | \*\*5\*\* | 54s | 8.8s | $0.006 |

| \*\*gpt-5.4-mini\*\* | 4 | 0.6063 | 20/19 | 13/19 (68%) | 6 | \*\*26s\*\* | \*\*3.5s\*\* | $0.015 |

| gemini-3-flash | FAIL | — | — | — | — | >315s | 11.3s | $0.010 |

| \*\*gemini-3.1-flash-lite\*\* | 4 | 0.6125 | 16/16 | 11/16 (69%) | \*\*5\*\* | \*\*13s\*\* | \*\*2.1s\*\* | $0.005 |

| mistral-small-3.2 | FAIL | — | — | — | — | — | — | $0.001 |



\## Key Findings



\*\*Speed winner: Gemini 3.1 Flash Lite at 13 seconds total.\*\* Adherence calls average 2.1 seconds — 3x faster than gpt-4o-mini (6.7s) and the fastest model that completed. gpt-5.4-mini is second at 26s.



\*\*Attribution winner: gpt-4.1-mini at 71% relevant.\*\* Only 5 cross-attributed records vs gpt-4o-mini's 8. It also produced Band 4 (which we think is more accurate for this essay). But it's the slowest successful model at 54 seconds.



\*\*Best value: Gemini 3.1 Flash Lite.\*\* 13 seconds, $0.005/essay, 69% attribution, Band 4. It's 3x faster than gpt-4o-mini at 2x the price, with better attribution accuracy (69% vs 56%) and what we believe is a more accurate band.



\*\*gpt-4o-mini is the outlier giving Band 5.\*\* Every other model produced Band 4. This confirms our earlier analysis — gpt-4o-mini is the most generous with cross-attribution (8 wrong), inflating state\_action to 1.0 with First Amendment text.



\*\*Cross-attribution pattern is consistent across all models:\*\* substantive\_dp gets First Amendment vagueness text cited as facts/analysis, and the shared "procedural or substantive" sentence gets attributed to whichever DP issue is being scored. This is an essay quality issue, not a model issue.



\*\*Two failures:\*\*

\- Gemini 3 Flash hit the runtime budget limit (feedback renderer call took 261 seconds). The budget meter caps wall-clock at 3x the limit. Would need to raise the limit or investigate the gateway latency.

\- Mistral Small 3.2 — model not found on the gateway. May need a different routing ID.



\## Component Scores Comparison



| Model | Spotting | Rule | Facts | Analysis | Conclusion |

|-------|----------|------|-------|----------|------------|

| gpt-4o-mini | 0.925 | 0.700 | 0.700 | 0.700 | 0.925 |

| gpt-4.1-mini | 0.775 | 0.550 | 0.550 | 0.550 | 0.775 |

| gpt-5.4-mini | 0.775 | 0.550 | 0.550 | 0.550 | 0.775 |

| gemini-3.1-flash-lite | 0.800 | 0.500 | 0.575 | 0.575 | 0.800 |



Gemini 3.1 Flash Lite produces the most differentiated component scores — facts/analysis (0.575) diverge from rule (0.500), suggesting it's distinguishing evidence quality per dimension more precisely than the OpenAI models which cluster at 0.550.



\*\*My recommendation: Gemini 3.1 Flash Lite is the strongest candidate to replace gpt-4o-mini.\*\* It's faster (13s vs 42s), produces more accurate attribution (69% vs 56%), gives the more honest Band 4, and costs only $0.005/essay. The confidence calibration fix we discussed would further improve scoring accuracy regardless of which model we choose.

