# Paddock Drill Matching: Diagnosis, Evidence, and Fix Plan

## 1. What This Document Is

This is a complete technical brief for the drill matching system in SHEP, a legal education platform. It contains:

1. How drills work (architecture)
2. What's broken (diagnosis with measured evidence)
3. What we tested (simulation results)
4. What we plan to do (6-step fix, benchmark-gated)
5. What three internal VP reviewers flagged (incorporated)
6. What two rounds of external architectural review flagged (incorporated)

This document is self-contained. No prior context is needed.

---

## 2. How Drills Work

### The Student Experience

A drill tests a student's ability to identify and apply the elements of a legal rule. The student sees:

- A **rule** (e.g., "A battery occurs when the defendant acts intending to cause harmful contact...")
- A **fact pattern** (a short narrative with named parties and specific events)
- A **Building Blocks panel** on the right -- circles grouped by legal element (Intent, Harmful Contact, Causation, etc.), each circle representing one key fact the student should identify

The student types free-form legal analysis. Circles fill in live as the system detects the student demonstrating each concept.

### The Three-Layer Matching Architecture

Matching runs in three layers. A circle lights if ANY layer says it matched (union semantics). Once lit, a circle stays lit ("once lit stays lit").

**Layer 1 -- Stem Matching (per keystroke, 200ms debounce)**

The student's response is Porter-stemmed. Each keyFact stores `match.groups`: an array of synonym arrays. A group is "hit" if any stem in the group matches any response token (with edit-distance-1 typo tolerance for stems >= 4 chars).

Threshold logic in `checkMatchGroups`:
- 2 groups -> both must hit (`hitCount === 2`) -- strictest
- 3+ groups -> `hitCount >= ceil(n/2)` -- with a proximity bonus if short by exactly 1
- 1 group -> single hit (legacy backward-compat)

**Layer 2 -- Embedding Match (500ms debounce, server round-trip)**

Cosine similarity between a vector embedding of the student's rolling response and pre-computed embeddings of each keyFact display text. Lights if similarity >= 0.50. Catches paraphrase that stems miss.

**Layer 3 -- LLM Upgrade (fires after 10 seconds of inactivity)**

When the student pauses with dark circles remaining, an LLM evaluates whether the response semantically covers those facts. Most expensive layer, fires ~1-3 times per drill session.

### How Content Is Structured

Each drill JSON file contains:

```typescript
interface DrillContent {
  rule: string;           // Legal rule being tested
  factPattern: string;    // Fact scenario with named parties
  parties: string[];      // Named parties (e.g., ["Nate Osei", "Floyd Carver"])
  elements: DrillElement[]; // 2-6 legal elements, each with:
    // name: string        - Element name (e.g., "Castle Doctrine")
    // definition: string  - Concise definition
    // keyFacts: DrillKeyFact[] - 1-3 key facts per element, each with:
      // display: string   - Circle label (e.g., "Floyd reached into his jacket pocket")
      // match: { groups: string[][] } - Synonym groups for stem matching
      // embedding?: number[] - Pre-computed sentence embedding (384-dim)
}
```

### How Content Is Generated

Drills come from two sources:
1. **Pre-authored content bank** -- 56 drill JSON files across 6 legal domains (Torts, Criminal Law, etc.)
2. **Dynamic generation** -- when a student uploads a syllabus, `run-content-generator.ts` calls an LLM to generate drill content, validated by `validateDrillContent()` in `content-quality-gate.ts`

Both paths produce the same `DrillContent` shape. Both are affected by the bug.

---

## 3. What's Broken

### The Intended Match Group Structure

The exemplar content (used as a few-shot example for the LLM) shows the correct pattern:

```json
{
  "display": "grabbed binder from hands",
  "match": {
    "groups": [
      ["grab", "seiz", "snatch", "pull"],
      ["binder", "folder", "document"]
    ]
  }
}
```

- **2-3 groups**, each representing one distinct semantic concept
- **3-5 synonyms per group** -- alternative words a student might use
- With 2 groups, `checkMatchGroups` requires both to hit -- the student must demonstrate the action (grab/seize) AND the object (binder/folder)

### What Was Actually Generated

80% of keyFacts (288/361) across all 56 drills have this pattern:

```json
{
  "display": "Floyd reached into his jacket pocket",
  "match": {
    "groups": [
      ["floyd", "floyds"],
      ["reach", "reached"],
      ["jacket", "jackets"],
      ["pocket", "pockets"]
    ]
  }
}
```

- **4-13 groups**, each containing one word and its inflection
- Each "group" is a single word, not a synonym set
- For 4 groups: `ceil(4/2) = 2` -- student only needs ANY 2 of 4 words
- For 7 groups: `ceil(7/2) = 4` -- student only needs ANY 4 of 7 words
- Words like "floyd", "nate", "claim", "action", "court" appear in groups across MULTIPLE elements

**Critical note on 4-group keyFacts:** 4 groups is the most dangerous count under `ceil(n/2)`. `ceil(4/2) = 2` means any two shallow hits light the circle. A keyFact with 4 inflection-only groups (e.g., `["reach","reached"], ["jacket","jackets"], ["pocket","pockets"], ["floyd","floyds"]`) can be satisfied by any two common words appearing anywhere in the student's response. This makes 4-group inflectional keyFacts the single most permissive configuration under the unchanged runtime.

### Why This Causes Cross-Element Bleed

In the Self-Defense drill, "floyd" and "nate" appear as groups in nearly every element:

| Element | KeyFact | Contains "floyd"? | Contains "nate"? |
|---|---|---|---|
| Reasonable Belief | "Floyd reached into his jacket pocket" | Yes | No |
| Imminence | "Floyd had no weapon..." | Yes | No |
| Imminence | "Nate shot Floyd in the shoulder..." | Yes | Yes |
| Proportional Force | "Floyd shoved Nate against the wall..." | Yes | Yes |
| Proportional Force | "Nate believed Floyd was reaching..." | Yes | Yes |
| Duty to Retreat | "Floyd came onto Nate's porch..." | Yes | Yes |
| Castle Doctrine | "Floyd is significantly larger than Nate..." | Yes | Yes |
| Initial Aggressor | "The confrontation occurred on Nate's own porch" | No | Yes |

When a student writes a single sentence mentioning both "Floyd" and "Nate", they hit groups in 6 out of 8 keyFacts simultaneously. With `ceil(n/2)` thresholds, 2 group hits is enough to light most circles.

---

## 4. Measured Evidence

### Simulation Methodology

We wrote a simulation script (`drill-match-simulation.ts`) that uses the **exact production functions** (`stemTokenize` + `checkMatchGroups` from `matching.ts`) against **actual drill content** from the bank. The simulation feeds realistic student sentences that target ONE specific element at a time and reports which circles light across ALL elements.

### Targeted Element Tests (20 tests across 8 drills, 6 domains)

| Drill | Student targets... | On-Target circles | Off-Target circles | Bleed Rate |
|---|---|---|---|---|
| Self-Defense | Castle Doctrine | 0 | 5 | 63% |
| Personal Jurisdiction | Purposeful Availment | 2 | 3 | 75% |
| Claim Preclusion | Same Transaction | 1 | 3 | 60% |
| Murder | Premeditation | 1 | 3 | 60% |
| Equal Protection | Discriminatory Purpose | 1 | 3 | 60% |
| Consideration | Pre-Existing Duty | 1 | 3 | 60% |
| Battery | Intent | 1 | 2 | 50% |
| Battery | Harmful Contact | 1 | 2 | 50% |
| **TOTAL (20 tests)** | | **18** | **35** | |

**Result: ~2 wrong circles light for every 1 correct circle.**

### Generic Filler Test (entire bank)

4 sentences of generic legal prose ("The court must consider the applicable legal standards...") tested against all 56 drills: 3/361 keyFacts lit (1%). Generic filler is not the primary problem -- cross-element vocabulary collision is.

### Full Bank Dry-Run (Gemini Flash repair + blind cross-evaluation)

We built a repair pipeline that:
1. **Call A (Gemini Flash):** Given the drill (without match groups), generate realistic student sentences targeting each element. Blind to match groups.
2. **Call B (Gemini Flash, separate call):** Given the drill, fix element assignment + generate proper 2-3 synonym groups. Blind to test sentences.
3. **Cross-evaluate:** Run Call A's sentences through `checkMatchGroups` with both current and repaired groups.

Results across all 56 drills:

| Category | Drills | Examples |
|---|---|---|
| Perfect (0 off-target after repair) | 8 | claim-preclusion, erie-doctrine, equal-protection, murder |
| Improved | 32 | self-defense (76% reduction), concurrent-ownership (88%), parol-evidence (92%) |
| Flat | 2 | covenants, false-imprisonment |
| **Regressed** | **12** | commerce-clause, takings, pleading-standards, state-action |
| Failed validation | 2 | first-amendment-religion, zoning |

**Aggregate: 504 off-target -> 307 (39% reduction)**

The 12 regressions correlate with drills having 5-6 elements with heavily overlapping legal vocabulary (e.g., Commerce Clause has 5 elements all involving "interstate commerce"; Takings has 6 Penn Central factors all sharing "economic", "government", "property").

---

## 5. Root Causes

| Root Cause | Where | Impact |
|---|---|---|
| LLM generated word-by-word tokenization instead of semantic synonym groups | Content generation (original prompt) | 80% of keyFacts broken |
| Quality gate has no max-groups-per-keyFact check | `content-quality-gate.ts` | Broken content passes validation |
| Quality gate has no party-name-in-groups check | `content-quality-gate.ts` | Party names collide across elements |
| Prompt lacks explicit BAD example of the broken pattern | `drill-prompt.ts` | LLM repeats the pattern on new drills |
| `ceil(n/2)` threshold assumes 2-3 groups of synonyms, not 7-13 single-word groups | `matching.ts` | Threshold is too loose for broken content |
| 4-group inflectional keyFacts are the most dangerous configuration | `matching.ts` + content | `ceil(4/2)=2` means any two shallow hits light the circle |
| KeyFacts sometimes assigned to wrong element | Content (e.g., size disparity under Castle Doctrine instead of Proportional Force) | Even correct matching tests the wrong concept |

---

## 6. Design Decisions

### D1: Fix content, not the algorithm
Do NOT change `ceil(n/2)` in `checkMatchGroups`. The algorithm is correct for well-structured content (2-3 groups of genuine synonyms). The 39% improvement came from content repair alone. However, because D1 stands, the content contract and degenerate guard must absorb the risk that `ceil(n/2)` creates for malformed content -- particularly 4-group inflectional keyFacts where `ceil(4/2)=2`.

### D2: Don't touch `matching.ts` signature
`checkMatchGroups` is a pure function on the per-keystroke hot path. All filtering happens BEFORE calling it, in `DrillView.svelte` at drill load time. The matching function stays clean.

### D3: Party names are discriminators, not contaminants
Party names (e.g., "Brightfield", "Kimura") are the most element-specific vocabulary available. Instead of filtering them out at runtime, enforce ownership in content: each party name may appear in **at most one element's** match groups, and a party name alone must never be sufficient to satisfy a keyFact. This preserves their discriminative power while preventing cross-element collision.

**Note:** An earlier version of this plan proposed global party-stem filtering at runtime (stripping party tokens before calling `checkMatchGroups`). External review identified this as contradicting D3. If party names are useful discriminators, the fix belongs in content contracts, not runtime blanking. This has been corrected.

### D4: Benchmark is the gate, not the repair script
The real failure mode is sibling-element bleed under targeted input, not generic filler matching (which only lit 3/361 keyFacts). The release gate must test what actually breaks: per-element targeted inputs with sibling-element negatives. A per-drill confusion matrix replaces the weaker `factPattern + generic paragraph` cross-evaluation.

**Benchmark determinism:** Test bundles are generated once per drill, versioned alongside the drill JSON, and replayed deterministically. They are NOT regenerated per CI run. Nondeterministic gates erode trust.

### D5: Ship only improvements, never regressions
The batch repair showed 12 regressions out of 56. Those 12 stay on current content under `guarded` or `semantic_review` routing. Only drills that pass the confusion-matrix benchmark ship as `standard`.

### D6: Route drills by risk, don't leave broken content live
Instead of binary ship-or-skip, classify drills into three capability tiers after benchmark evaluation:
- **`standard`** -- passes benchmark, uses normal live matching stack
- **`guarded`** -- has degenerate keyFacts; Layer 1 disabled for those keyFacts, relies on embedding + LLM layers
- **`semantic_review`** -- high-overlap doctrines where stem matching is structurally insufficient; circles are visually and behaviorally provisional (not permanent fills) until pause or submit. This is necessary because "once lit stays lit" makes false positives especially costly -- a provisional state avoids that permanence.

**Routing thresholds must be numeric and explicit** before implementation (defined in Step 2).

### D7: No human review in the loop
All repair, element reassignment detection, and rewrite operations are AI-automated. The benchmark confusion matrix is the quality backstop. AI-repaired content is flagged (`ai_rewritten: true`) for auditability but ships automatically if it passes the benchmark gate.

**Circular validation caveat (Lane B and PR F):** Groups-only repair (Lane A) can auto-ship on benchmark pass. But element reassignment and display-text rewrites carry higher risk of circular validation -- AI-generated tests derived from mistaken content can pass even when the content is doctrinally wrong. Lane B and PR F outputs must be benchmarked from `rule + element.definition + factPattern` (blind to current keyFacts) and default to `semantic_review` routing until they prove out over multiple benchmark cycles.

### D8: Observability before product changes AND before routing
Layer attribution (what lit, when, from which layer), rescue-rate telemetry (how often LLM stuck-check finds matches stems missed), and false-positive sampling. These must land **before or alongside** routing (not after), because routing thresholds cannot be set confidently without telemetry to validate them.

### D9: Dynamic drills default to `guarded` until benchmarked
Drills generated dynamically (on syllabus upload) are not complete until their confusion-matrix test bundle is created and routing is assigned. Until that happens, they default to `guarded` tier. This ensures the benchmark gate applies to dynamically generated content, not just the pre-authored bank.

---

## 7. The Fix Plan

### Step 1: Authoring Contract (PR A)

Stop creating new bad drills. This is the highest-leverage change.

**Quality gate hardening (`content-quality-gate.ts`):**
- Max 3 groups per keyFact for new/repaired content (`MAX_GROUPS_PER_KEYFACT = 3` shared constant). This is tighter than the previous value of 4, because `ceil(4/2)=2` makes 4-group keyFacts the most dangerous configuration under the unchanged runtime. With max 3 groups and threshold `ceil(3/2)=2`, both semantic concepts must be present.
- Min 2 groups per keyFact (`MIN_GROUPS_PER_KEYFACT = 2`)
- Inflection-only group detection using **stem-canonical comparison**: stem each entry in the group, then check `new Set(stemmed).size < 2`. Raw string comparison (`new Set(group).size`) misses pairs like `["reach","reached"]` and `["equal","equals"]` which stem to the same token but have distinct raw strings.
- Party name ownership: each party name stem (from `content.parties`) may appear in at most ONE element's groups
- Party name sufficiency: a keyFact whose groups contain only party names and common legal stems must be flagged
- Update Zod schema to match new bounds

**Shared constants** (new file or added to quality gate):
```typescript
export const MAX_GROUPS_PER_KEYFACT = 3;
export const MIN_GROUPS_PER_KEYFACT = 2;
```

**Prompt strengthening (`drill-prompt.ts`):**
- Hard Rule: "2-3 groups per keyFact" (using shared constant)
- Party name rule: "may appear in at most one element's groups; must never be the only distinguishing signal"
- Explicit BAD example of word-by-word tokenization
- Self-check for inflection-only groups: "Stem each synonym. If all entries in a group stem to the same root (e.g., 'reach'/'reached' -> 'reach'), it is an inflection group, not a synonym group. Replace it."
- `buildRetryFeedback()`: guidance for new error types

**Prerequisite cleanup:** Remove `isGenericContent()` from `run-content-generator.ts` -- duplicates `validateDrillContent()`.

### Step 2: Benchmark Harness + Degenerate Guard (PR B)

The permanent quality gate for all drill content, current and repaired.

**Per-drill confusion-matrix harness:**
- For each element, generate at least 5 test cases:
  1. **Direct positive** -- sentence targeting this element's concepts
  2. **Paraphrase positive** -- same concepts, different vocabulary
  3. **Sibling-element negative** -- sentence targeting a different element
  4. **Party-collision negative** -- sentence using party names without element-specific concepts
  5. **Generic filler negative** -- common legal prose
- Test cases generated by AI (Gemini Flash), blind to match groups
- Test bundles generated **once per drill**, versioned alongside drill JSON, replayed deterministically
- Run through exact production `stemTokenize` + `checkMatchGroups`

**Release thresholds (per-drill, not per-keyFact):**
- Diagonal recall: must stay flat or improve (on-target circles don't decrease)
- Off-diagonal bleed: must drop (off-target circles decrease)
- Party-collision false positives: zero tolerance
- Numeric cutoffs for tier classification defined here (exact values calibrated against current bank baseline)

**Degenerate guard ships in this PR too** (immediate student relief):
- `isDegenerateKeyFact`: catches keyFacts with >3 groups where all groups are inflection-only (stem-canonical check), **and also** catches exactly-3-group keyFacts where all groups are inflection-only. The guard uses stem-canonical detection, not raw group count alone.
- Skip `checkMatchGroups` for flagged keyFacts -- let embedding + LLM layers handle
- Sentry breadcrumb when guard triggers

```typescript
// Computed once when drill loads
function isInflectionOnlyGroup(group: string[]): boolean {
  const stemmed = new Set(group.map(entry => porterStem(entry)));
  return stemmed.size < 2;
}

const degenerateKeyFacts = new Set<string>();
for (const el of drill.elements) {
  for (const kf of el.keyFacts) {
    const allInflectional = kf.match.groups.every(isInflectionOnlyGroup);
    if (allInflectional && kf.match.groups.length >= 3) {
      degenerateKeyFacts.add(kf.display);
    }
  }
}
```

**Dynamic drill integration:** Dynamically generated drills (syllabus upload) default to `guarded` tier until their confusion-matrix test bundle is generated and routing is assigned. Generation is not complete until the bundle exists.

### Step 3: Content Repair (PR C)

Two automated lanes, both benchmarked per-drill.

**Lane A -- Match groups repair (automated, can auto-ship):**
- Gemini Flash: given drill content, generate proper 2-3 synonym groups per keyFact
- Constrained to `match.groups` only -- display text, embeddings, all other fields frozen
- Validated through hardened quality gate (Step 1)
- Benchmarked through confusion matrix (Step 2)
- **Can auto-ship if benchmark passes** -- groups-only repair has low circular validation risk

**Lane B -- Element reassignment detection (automated, ships as `semantic_review` initially):**
- Separate AI call: evaluate whether each keyFact semantically belongs under its assigned element
- Benchmarked from `rule + element.definition + factPattern`, **blind to current keyFacts** -- this prevents circular validation where AI-generated tests derived from mistaken content pass even when the assignment is doctrinally wrong
- If misassignment detected: propose move + generate new groups for new position
- Must pass sibling-negative tests in BOTH old and new element positions
- Flagged with `ai_reassigned: true` for auditability
- **Defaults to `semantic_review` routing** until proven out over multiple benchmark cycles

**Shipping rule:** Lane A drills ship if they pass the benchmark. Lane B drills default to `semantic_review` and earn their way to `standard` through repeated benchmark consistency.

### Step 4: Observability (PR D -- lands before or with routing)

Foundation for all future matching decisions. **Must land before routing thresholds are finalized.**

**Layer attribution:** For every circle that lights, log which layer(s) triggered it (stem, embedding, LLM). Structured telemetry, not console logs.

**Rescue-rate telemetry:** When LLM stuck-check fires AND finds matches that stems and embeddings missed, log a structured event. This is the canary for false negatives -- if stems become too strict, this rate climbs.

**False-positive sampling:** Periodically sample cases where Layer 1 lit a circle but Layer 2 (embedding) disagreed. These are the highest-signal candidates for content quality issues.

### Step 5: Capability Routing (PR E -- after or merged with observability)

Classify every drill based on benchmark results + observability data.

**Three tiers:**
- **`standard`** -- passes benchmark. Normal live matching stack (all 3 layers). Circles use permanent "once lit stays lit" semantics.
- **`guarded`** -- has degenerate keyFacts that couldn't be repaired. Layer 1 disabled for those specific keyFacts; embedding + LLM layers handle them. Rest of drill uses normal stack. Also the default tier for dynamically generated drills awaiting benchmark.
- **`semantic_review`** -- high-overlap doctrines (commerce clause, takings, etc.) where stem matching is structurally insufficient across the entire drill. Circles are **visually and behaviorally provisional** -- they do NOT use "once lit stays lit" permanence. Instead, they show as suggestions that are confirmed at pause or submit. This is necessary because permanent false positives in these drills are the core UX failure.

**Classification is automated:** benchmark confusion-matrix scores determine tier. Numeric thresholds calibrated against bank baseline with observability data.

**Routing thresholds (to be calibrated during Step 2, finalized with Step 4 data):**
- `standard`: off-diagonal bleed rate < X%, diagonal recall >= Y%
- `guarded`: off-diagonal bleed rate < Z% with degenerate guard active
- `semantic_review`: everything else (bleed rate above thresholds, or structural vocabulary overlap)

### Step 6: AI Rewrite Pipeline for Overlap-Heavy Doctrines (PR F)

The 12 regressed drills (and any future drills classified as `semantic_review`) enter a separate pipeline.

**Two strategies, both automated:**
1. **Split:** AI proposes splitting a 6-element drill into two 3-element drills with cleaner vocabulary boundaries. Each sub-drill covers a subset of elements with less cross-element overlap.
2. **Rewrite:** AI rewrites keyFact display text to increase discriminability while preserving legal accuracy (constrained by rule text and fact pattern).

**Validation against circular risk:** Benchmark tests for rewritten drills are generated from `rule + element.definition + factPattern` only, blind to the rewritten keyFacts. This prevents the AI from generating tests that validate its own mistakes.

**All outputs benchmarked** through the confusion matrix before shipping. Flagged with `ai_rewritten: true` for auditability. **Default to `semantic_review` routing** until they prove out.

**The proposition engine remains V2.** These drills get controlled downgrade + AI rewrite now. Proposition-based matching (typed slots, evidence spans, polarity detection) is the right long-term architecture for structurally hard doctrines, but it's 10x scope.

---

## 8. Implementation Order

| Step | What | PR | Dependency |
|------|------|-----|------------|
| 1 | Authoring contract: gate + prompt + schema + constants + remove duplicate validator | PR A | None |
| 2 | Benchmark harness + release thresholds + degenerate guard + dynamic drill default | PR B | PR A merged |
| 3 | Content repair: Lane A (groups, can auto-ship) + Lane B (reassignment, defaults to semantic_review) | PR C | PR B merged |
| 4 | Observability: layer attribution + rescue-rate + false-positive sampling | PR D | Can parallel with PR C |
| 5 | Capability routing: standard / guarded / semantic_review with numeric thresholds | PR E | PR D merged (needs telemetry to finalize thresholds) |
| 6 | AI rewrite pipeline for overlap-heavy doctrines | PR F | PR B merged (needs benchmark) |

**Note:** Observability (PR D) now lands before routing (PR E), not after. This resolves the dependency: routing thresholds cannot be set confidently without telemetry to validate them.

---

## 9. Success Criteria

| Criterion | How Measured |
|---|---|
| No new drill ships with >3 groups per keyFact | Quality gate blocks at generation time |
| No inflection-only groups pass validation (stem-canonical check) | Quality gate catches `["reach","reached"]` patterns |
| No party name collides across elements in new content | Quality gate catches at generation time |
| Per-drill confusion matrix shows improvement before shipping repairs | Benchmark harness (diagonal recall flat/up, off-diagonal bleed down) |
| Degenerate keyFacts fall back gracefully at runtime | isDegenerateKeyFact guard (stem-canonical) + Sentry breadcrumbs |
| Dynamic drills default to guarded until benchmarked | Generation pipeline produces test bundle + assigns routing |
| Every drill classified into standard/guarded/semantic_review | Automated routing based on benchmark + observability data |
| No regressions shipped | Per-drill benchmark gate; only improvements pass |
| semantic_review circles are provisional, not permanent | No "once lit stays lit" for provisional circles |
| Overlap-heavy doctrines get controlled downgrade, not broken live behavior | semantic_review routing with deferred confirmation |
| Lane B and PR F outputs default to semantic_review | Higher-risk AI changes earn their tier through repeated benchmark cycles |
| Layer attribution enables data-driven product decisions | Structured telemetry for every circle-light event |
| Recall regression is detectable | Rescue-rate telemetry (LLM stuck-check finds what stems missed) |
| Benchmark is deterministic and versioned | Test bundles generated once, replayed on CI, not regenerated |

---

## 10. What This Does NOT Address (Future Work)

- **Proposition-based matching** -- for drills with 5-6 highly overlapping elements, stem matching is inherently insufficient regardless of content quality. A proposition engine with typed slots, evidence spans, and polarity detection is the right long-term architecture. Scope as a V2 milestone.
- **Embedding threshold tuning** -- stricter stems change the distribution of what reaches Layer 2. The cosine threshold (0.50) may need recalibration against repaired content. Observability (Step 4) will provide the data.
- **"Once lit stays lit" policy for standard drills** -- union semantics mean false positives from any layer are permanent in `standard` tier. Considered out of scope for this fix. (`semantic_review` tier addresses this with provisional circles.)
- **Warm circle state** -- postponed until observability data justifies it. If needed, scoped to guarded/semantic_review drills, not global.

---

## 11. Review History

**Internal VP Review (2026-04-20):** VP Product, VP Engineering, VP Design. All recommendations incorporated. Key changes: D1 (fix content not algorithm), D2 (don't touch matching.ts), D3 (party names as discriminators), added warm circle state concept, added observability requirement.

**External Architectural Review, Round 1 (2026-04-21):** Identified five issues in the original plan:
1. **Weak evaluation gate** -- original Phase 4 used `factPattern + generic paragraph` as the cross-evaluation, but generic filler is not the real failure mode. Replaced with per-drill confusion matrix testing sibling-element bleed directly.
2. **D3/Phase 2 contradiction** -- D3 treats party names as discriminators, but Phase 2 proposed filtering them from runtime tokens. Resolved: enforcement moved to content contracts, runtime filtering dropped.
3. **Phase 4 contradiction** -- claimed to fix element assignment while also claiming only `match.groups` change. Resolved: split into Lane A (groups-only) and Lane B (element reassignment detection) with separate, stricter gates.
4. **Binary ship-or-skip insufficient** -- 12 regressed drills left as known-bad with tracking tickets. Replaced with three-tier capability routing (standard/guarded/semantic_review).
5. **Warm state as hedge** -- "warm" circle state was being used to cushion stricter stems rather than being justified by data. Postponed until observability proves need; scoped to affected tiers only.
6. **Center of gravity shift** -- moved from "repair content" to "benchmark and route by risk." The benchmark confusion matrix is the quality backstop for all decisions.

**External Architectural Review, Round 2 (2026-04-21):** Tightened five points:
1. **Degenerate guard too narrow** -- original guard only caught >4 groups; 4-group inflectional keyFacts are the most dangerous under `ceil(4/2)=2`. Resolved: lowered `MAX_GROUPS_PER_KEYFACT` to 3 for new content, broadened guard to use stem-canonical inflection detection.
2. **Inflection check needs canonical stemming** -- `new Set(group).size < 2` on raw strings misses `["reach","reached"]`. Resolved: gate stems each entry before checking set uniqueness.
3. **Lane B and PR F circular validation risk** -- AI tests derived from mistaken content can pass even when content is doctrinally wrong. Resolved: Lane B/PR F benchmarked from `rule + element.definition + factPattern` blind to current keyFacts; default to `semantic_review` until proven out.
4. **Benchmark must be deterministic + integrated into dynamic generation** -- test bundles generated once, versioned, replayed. Dynamic drills default to `guarded` until benchmarked.
5. **Routing needs explicit thresholds + observability first** -- `semantic_review` circles must be provisional (not permanent fills). Observability (PR D) lands before routing (PR E). Numeric cutoffs defined and calibrated with telemetry.

---

## 12. Appendix: Key Files

| File | Role |
|---|---|
| `apps/web-svelte/src/lib/practice/matching.ts` | Production matching functions (stemTokenize, checkMatchGroups, isGroupHit) |
| `apps/web-svelte/src/lib/server/paddock/content-quality-gate.ts` | Content validation (validateDrillContent) |
| `apps/web-svelte/src/lib/server/paddock/drill-prompt.ts` | LLM prompt for drill generation (buildDrillPrompt, buildRetryFeedback) |
| `apps/web-svelte/src/lib/server/paddock/run-content-generator.ts` | Dynamic drill generation pipeline (generateDrillContent) |
| `apps/web-svelte/src/lib/drills/DrillView.svelte` | Client-side drill UI with matching loop |
| `apps/web-svelte/src/lib/drills/BuildingBlocksPanel.svelte` | Circle rendering UI |
| `apps/web-svelte/src/lib/practice/content/paddock/**/*.json` | 56 drill content files |
| `apps/web-svelte/src/lib/practice/content/paddock/exemplars/*.json` | Gold-standard exemplar content |
| `scripts/drill-match-simulation.ts` | Simulation harness (production functions against real content) |
| `scripts/repair-drill-content.ts` | Batch repair pipeline (Gemini Flash + validation + cross-evaluation) |
