Here's my analysis of the grading run.



\## Grading Summary



| Metric | Value |

|--------|-------|

| Band | 5 / 6 ("Exceeds expectations") |

| Raw Score | 0.79 |

| Confidence | 0.8325 |

| Issues Found | 4/4 recognized |

| LLM Calls | 7 |

| Duration | 38.8 seconds |

| Cost | $0.065 |



\## Issue-by-Issue Breakdown



| Issue | Strength | Score | Missed Items |

|-------|----------|-------|-------------|

| State Action | Strong | 1.0 | None |

| First Amendment | Strong | 1.0 | None |

| Procedural Due Process | Satisfactory | 0.6 | Rule (0.6 conf), Analysis (0.5 conf) |

| Substantive Due Process | Strong | 1.0 | None |



\## Functional Analysis



\*\*What worked well:\*\*

\- All 4 issues were correctly identified and recognized

\- The checklist scoring correctly flagged that procedural DP was weaker (the student did mention it but didn't explicitly state the Mathews v. Eldridge balancing test or other procedural DP rules)

\- The per-issue feedback narratives are specific and actionable

\- Band 5 seems reasonable for a 300-word essay that hits all issues but lacks depth on one



\*\*Problems I see:\*\*



1\. \*\*Evidence verification is broken.\*\* 10 of 18 evidence records were `rejected`, but many of those quotes are clearly present in the submission. For example, `"The policy is facially content-based because it allows punishment of speech deemed 'harmful, offensive, or disruptive'"` is a near-exact match to the student's text but was rejected. The normalized substring matcher is too strict — it's failing on minor punctuation differences (smart quotes vs straight quotes, trailing punctuation). This means evidence verification is not meaningfully contributing to scoring right now.



2\. \*\*Evidence is being cross-attributed.\*\* The `i\_state\_action` issue has evidence records citing quotes about "unconstitutional content-based restriction" and "procedural and substantive due process claim" — those quotes are from the First Amendment and Due Process sections, not the state action section. The LLM is pulling quotes from the wrong part of the essay for each issue.



3\. \*\*Component scores are suspiciously uniform.\*\* All 5 dimensions (spotting, rule, facts, analysis, conclusion) show 0.79. That's the aggregate `raw\_score` repeated across all components, not individual dimension scores. The aggregator is likely averaging issue scores and applying the same value across all components rather than computing per-dimension scores.



4\. \*\*Model manifest ID is wrong in the DB.\*\* It recorded `openai-gpt-5-mini-2026-02-14` — the old fictional manifest ID. The bundle's `compatible\_model\_manifests` still references this. We fixed the `model\_manifests` table row to use `gpt-4o-mini` but didn't update the `model\_manifest\_id` field in that row or the bundle's allowlist. It still worked because the allowlist check passes on the old ID, and `resolveModel` returns the correct routing string from the manifest's `checklist\_scorer.name` field.



5\. \*\*Calibration is pass-through\*\* (as expected for launch). `calibrated\_score = raw\_score = 0.79`. This is fine for now.



For the \*\*split-screen results UI\*\*, I'll save that as a separate task. Want me to create a Linear issue for the evidence verification bug and the results page redesign, or fix the evidence verifier now?