# Paddock Drill Matching Fix -- Revised Plan

**Reviewed:** 2026-04-20 (VP Product, VP Engineering, VP Design)
**External review:** 2026-04-21 (two rounds of architectural review, all findings incorporated)
**Status:** Approved -- benchmark-and-route approach adopted, all five tightening points applied

---

## Problem Statement

80% of keyFacts (288/361) in the drill content bank have broken match groups -- word-by-word tokenization (7-13 single-word groups per keyFact) instead of 2-3 semantic synonym groups. Combined with the `ceil(n/2)` runtime threshold, this causes:

- **Cross-element bleed:** Writing about one element lights circles in 3-4 other elements
- **Party name triggering:** Mentioning "Floyd" or "Nate" lights 5+ circles across all elements
- **4-group inflection keyFacts:** The most dangerous configuration -- `ceil(4/2)=2` means any two shallow hits light the circle
- **Generic filler matching:** Common legal terms satisfy thresholds (minor -- 1% of keyFacts)

Drills are also generated dynamically when students upload syllabi. The fix must prevent new drills from being broken.

**Measured impact (simulation with production matcher):**
- 20 targeted tests: 35 false positive circles vs 18 correct matches (~2 wrong per 1 right)
- Self-Defense drill: targeting Castle Doctrine lit 5 circles in other elements
- Full bank: 504 total off-target circles across 56 drills

---

## Design Decisions

### D1: Fix content, not the algorithm
Do NOT change the `ceil(n/2)` threshold in `checkMatchGroups`. The matching algorithm is correct for well-structured content (2-3 genuine synonym groups). The 39% improvement in dry-run came entirely from content repair. Because D1 stands, the content contract and degenerate guard must absorb the risk -- particularly for 4-group inflectional keyFacts where `ceil(4/2)=2`.

### D2: Don't touch `matching.ts` signature
`checkMatchGroups` is a pure, zero-dependency client-side function on the per-keystroke hot path. It receives only `groups[][]` and `tokens[]`. All filtering (degenerate detection) happens BEFORE calling it -- in `DrillView.svelte` at drill load time.

### D3: Party names are discriminators, not contaminants
Party names are the most element-specific vocabulary available. Enforce ownership in content: each party name may appear in **at most one element's** match groups, and a party name alone must never be sufficient to satisfy a keyFact. No runtime filtering of party stems -- that would contradict the principle that party names are useful signals.

### D4: Benchmark is the gate, not the repair script
The real failure mode is sibling-element bleed under targeted input (35 off-target in 20 tests). Generic filler only lit 3/361 keyFacts. The release gate must test what actually breaks: a per-drill confusion matrix with element-targeted positives and sibling-element negatives. Test bundles are generated once, versioned alongside drill JSON, and replayed deterministically.

### D5: Ship only improvements, never regressions
Per-drill benchmark: diagonal recall flat or improved, off-diagonal bleed decreased. Drills that don't pass stay on current content under appropriate routing.

### D6: Route drills by risk
Three capability tiers based on benchmark results:
- **`standard`** -- passes benchmark, normal live matching, permanent circle fills
- **`guarded`** -- has degenerate keyFacts; Layer 1 disabled for those keyFacts, embedding + LLM handle them. Also default tier for dynamic drills awaiting benchmark.
- **`semantic_review`** -- high-overlap doctrines; circles are **visually and behaviorally provisional** (not permanent fills) until pause/submit

### D7: Fully automated -- no human review in the loop
All repair, element reassignment, and rewrite operations are AI-automated. The benchmark confusion matrix is the quality backstop. AI outputs are flagged for auditability but ship automatically if they pass the benchmark. Lane B (element reassignment) and PR F (rewrite) carry higher circular validation risk and default to `semantic_review` until proven out.

### D8: Observability lands before routing
Layer attribution, rescue-rate telemetry, and false-positive sampling must ship before routing thresholds are finalized. You cannot set confident tier cutoffs without telemetry data.

### D9: Dynamic drills default to `guarded` until benchmarked
Dynamically generated drills (syllabus upload) are not complete until their confusion-matrix test bundle is created and routing is assigned. Until that happens, they run as `guarded`.

---

## Implementation Plan

### Step 1: Authoring Contract (PR A)

**Goal:** Stop creating new bad drills. Highest-leverage single change.

**Files:** `content-quality-gate.ts`, `drill-prompt.ts`, schema files, shared constants

**Quality gate additions to `validateDrillContent()`:**
- Max groups per keyFact: `MAX_GROUPS_PER_KEYFACT` (3) -- tightened from 4 because `ceil(4/2)=2` makes 4-group keyFacts the most dangerous configuration under the unchanged runtime
- Min groups per keyFact: `MIN_GROUPS_PER_KEYFACT` (2)
- Inflection-only group detection using **stem-canonical comparison**: stem each entry, then `new Set(stemmed).size < 2`. Raw string comparison misses pairs like `["reach","reached"]` which stem to the same token.
- Party name ownership: each party name stem (from `content.parties`) may appear in at most ONE element's groups
- Party name sufficiency: flag keyFacts whose groups contain only party names + common legal stems

**Shared constants:**
```typescript
export const MAX_GROUPS_PER_KEYFACT = 3;
export const MIN_GROUPS_PER_KEYFACT = 2;
```

**Zod schema update:**
Add `groups` array max length (3) to `DrillContentSchema` so gate, schema, and prompt agree on bounds.

**Prompt changes (`drill-prompt.ts`):**
- Hard Rule: "2-3 groups per keyFact" (using shared constant)
- Party name rule: "may appear in at most one element's groups; must never be the only distinguishing signal"
- Explicit BAD example of word-by-word tokenization
- Self-check for inflection-only groups: "Stem each synonym. If all entries stem to the same root (e.g., 'reach'/'reached' -> 'reach'), it is an inflection group, not a synonym group."
- `buildRetryFeedback()`: guidance for new error types (too many groups, party name collision, inflection-only)

**Prerequisite cleanup (separate commit):**
Remove `isGenericContent()` from `run-content-generator.ts` -- duplicates `validateDrillContent()`.

### Step 2: Benchmark Harness + Degenerate Guard (PR B)

**Goal:** Permanent quality gate for all drill content + immediate student relief.

**Benchmark harness:**

Per-drill confusion matrix. For each element, AI-generated test cases (Gemini Flash, blind to match groups):

| Test Type | What It Tests | Expected Result |
|-----------|--------------|-----------------|
| Direct positive | Sentence targeting this element | On-target circles light |
| Paraphrase positive | Same concepts, different words | On-target circles light |
| Sibling-element negative | Sentence targeting a different element | This element's circles stay dark |
| Party-collision negative | Party names without element concepts | This element's circles stay dark |
| Generic filler negative | Common legal prose | All circles stay dark |

**Test bundle versioning:** Bundles generated once per drill, versioned alongside drill JSON. Replayed deterministically on CI. Never regenerated per run.

**Release thresholds (per-drill, numeric):**
- Diagonal recall: flat or improved
- Off-diagonal bleed: decreased, below tier-specific cutoff
- Party-collision false positives: zero
- Exact cutoffs for tier classification calibrated against current bank baseline

**Degenerate guard (ships in this PR for immediate relief):**
```typescript
function isInflectionOnlyGroup(group: string[]): boolean {
  const stemmed = new Set(group.map(entry => porterStem(entry)));
  return stemmed.size < 2;
}

const degenerateKeyFacts = new Set<string>();
for (const el of drill.elements) {
  for (const kf of el.keyFacts) {
    const allInflectional = kf.match.groups.every(isInflectionOnlyGroup);
    if (allInflectional && kf.match.groups.length >= 3) {
      degenerateKeyFacts.add(kf.display);
    }
  }
}
```
Skip `checkMatchGroups` for degenerate keyFacts -- embedding + LLM layers handle them. Sentry breadcrumb on trigger.

**Dynamic drill integration:** Dynamically generated drills default to `guarded` until their test bundle is created and routing is assigned.

### Step 3: Content Repair (PR C)

**Goal:** Fix existing content bank. Two automated lanes, different trust levels.

**Lane A -- Match groups repair (can auto-ship on benchmark pass):**
- Gemini Flash: generate proper 2-3 synonym groups per keyFact
- Only `match.groups` change -- display text, embeddings, all other fields frozen
- Validated through hardened quality gate (Step 1)
- Benchmarked through confusion matrix (Step 2)
- Low circular validation risk -- groups-only changes are well-tested by the benchmark

**Lane B -- Element reassignment detection (defaults to `semantic_review`):**
- Separate AI call: evaluate whether each keyFact belongs under its assigned element
- Benchmarked from `rule + element.definition + factPattern`, **blind to current keyFacts** -- prevents circular validation
- If misassignment detected: propose move + generate new groups for new position
- Must pass sibling-negative tests in BOTH old and new positions
- Flagged `ai_reassigned: true` for auditability
- **Defaults to `semantic_review` routing** until proven out over multiple benchmark cycles

**Shipping rule:** Lane A drills auto-ship on benchmark pass. Lane B drills default to `semantic_review` and earn `standard` through repeated benchmark consistency.

### Step 4: Observability (PR D -- lands before routing)

**Goal:** Foundation for all matching decisions. **Required before routing thresholds are finalized.**

**Layer attribution:** For every circle that lights, log which layer(s) triggered it. Structured telemetry.

**Rescue-rate telemetry:** When LLM stuck-check fires AND finds matches that cheaper layers missed. Canary for false negatives.

**False-positive sampling:** Cases where Layer 1 lit but Layer 2 disagreed. Highest-signal quality candidates.

**Data-driven decisions (after ~2 weeks):**
- Finalize routing tier thresholds
- Evaluate whether "warm" circle state is needed (and for which tiers)
- Evaluate whether cosine threshold (0.50) needs recalibration

### Step 5: Capability Routing (PR E -- after observability)

**Goal:** Every drill gets appropriate runtime behavior based on quality + telemetry.

**Three tiers, assigned by benchmark scores + observability data:**

| Tier | Criteria | Runtime Behavior |
|------|----------|-----------------|
| `standard` | Passes full benchmark; bleed rate < X% | Normal 3-layer matching; permanent circle fills |
| `guarded` | Has degenerate keyFacts or awaiting benchmark | Layer 1 skipped for degenerate keyFacts; embedding + LLM for those |
| `semantic_review` | High-overlap doctrine; bleed above thresholds | Circles are **provisional** (not permanent); confirmed at pause/submit |

**`semantic_review` UI semantics:** Circles do NOT use "once lit stays lit." They show as suggestions -- visually distinct from confirmed matches -- and are confirmed (or not) at pause or submit. This is necessary because "once lit stays lit" makes false positives permanent, and permanence is the core UX failure for these drills.

**Classification is automated.** Benchmark confusion-matrix scores + observability data determine tier. Numeric thresholds calibrated from Step 2 baseline + Step 4 telemetry.

### Step 6: AI Rewrite Pipeline for Overlap-Heavy Doctrines (PR F)

**Goal:** Fix the 12+ drills where stem matching is structurally insufficient.

**Two automated strategies:**

1. **Split:** AI proposes splitting a 6-element drill into two 3-element drills with cleaner vocabulary boundaries
2. **Rewrite:** AI rewrites keyFact display text to increase discriminability (constrained by rule text and fact pattern for legal accuracy)

**Anti-circular validation:** Tests generated from `rule + element.definition + factPattern` only, blind to rewritten keyFacts. Prevents AI from validating its own mistakes.

All outputs benchmarked before shipping. Flagged `ai_rewritten: true`. **Default to `semantic_review`** until proven out.

**Proposition engine remains V2.** These drills get controlled downgrade + AI rewrite now. Proposition-based matching is the right long-term architecture for structurally hard doctrines, scoped as a separate milestone.

---

## Implementation Order

| Step | What | PR | Dependency |
|------|------|-----|------------|
| 1 | Authoring contract: gate + prompt + schema + constants + remove duplicate | PR A | None |
| 2 | Benchmark harness + release thresholds + degenerate guard + dynamic drill default | PR B | PR A merged |
| 3 | Content repair: Lane A (auto-ship) + Lane B (defaults to semantic_review) | PR C | PR B merged |
| 4 | Observability: layer attribution + rescue-rate + false-positive sampling | PR D | Can parallel with PR C |
| 5 | Capability routing: standard / guarded / semantic_review with numeric thresholds | PR E | PR D merged |
| 6 | AI rewrite pipeline for overlap-heavy doctrines | PR F | PR B merged |

---

## Success Criteria

| Criterion | Metric |
|---|---|
| No new drill ships with >3 groups per keyFact | Quality gate blocks at generation time |
| No inflection-only groups pass validation | Stem-canonical check catches `["reach","reached"]` |
| No party name collides across elements in new content | Quality gate catches at generation time |
| Per-drill confusion matrix improves before shipping repairs | Diagonal recall flat/up, off-diagonal bleed down |
| 4-group inflectional keyFacts caught by degenerate guard | Stem-canonical detection, not just group-count |
| Dynamic drills default to guarded until benchmarked | Generation pipeline produces test bundle + assigns routing |
| Every drill classified standard/guarded/semantic_review | Automated routing based on benchmark + observability |
| 0 regressions shipped | Per-drill benchmark gate |
| semantic_review circles are provisional, not permanent | No "once lit stays lit" for suggestion circles |
| Lane B and PR F default to semantic_review | Higher-risk AI changes earn tier through repeated cycles |
| Overlap-heavy doctrines get controlled downgrade | semantic_review routing, not broken live behavior |
| Layer attribution enables data-driven decisions | Structured telemetry for every circle-light event |
| Recall regression detectable | Rescue-rate telemetry in production |
| Benchmark is deterministic | Test bundles versioned, replayed, not regenerated |

---

## Resolved Contradictions

### From External Review, Round 1

| Original Issue | Resolution |
|---|---|
| D3 says party names are discriminators; Phase 2 filtered them from runtime tokens | Dropped global party-stem filtering. Enforcement moved to content contracts (ownership + sufficiency rules) |
| Phase 4 claimed to fix element assignment while constraining repair to `match.groups` only | Split into Lane A (groups-only) and Lane B (element reassignment detection) with separate gates |
| Release gate tested generic prose instead of actual failure mode (sibling bleed) | Replaced with per-drill confusion matrix using element-targeted test cases |
| Binary ship-or-skip left 12 drills as known-bad with tracking tickets | Three-tier capability routing: standard/guarded/semantic_review |
| "Warm" circle state used as hedge against stricter stems | Postponed until observability data justifies; scoped to affected tiers |

### From External Review, Round 2

| Original Issue | Resolution |
|---|---|
| Degenerate guard only caught >4 groups; 4-group inflectional keyFacts are the most dangerous under `ceil(4/2)=2` | Lowered `MAX_GROUPS_PER_KEYFACT` to 3; broadened guard to stem-canonical inflection detection for any group count >= 3 |
| Inflection check used raw string uniqueness; misses `["reach","reached"]` | Gate stems each entry before checking set uniqueness |
| Lane B and PR F auto-ship despite circular validation risk from AI-generated tests | Lane B/PR F benchmarked blind to current keyFacts; default to `semantic_review` until proven out |
| Benchmark nondeterministic; not integrated into dynamic drill generation | Test bundles generated once, versioned, replayed. Dynamic drills default to `guarded` until benchmarked |
| Routing lacked numeric thresholds; `semantic_review` used permanent circle fills; observability landed after routing | Observability (PR D) before routing (PR E). `semantic_review` circles are provisional. Numeric cutoffs calibrated with telemetry |

---

## Follow-up Items (Linear issues to create)

- [ ] Audit cosine similarity threshold (0.50) against repaired content -- data from Step 4
- [ ] Evaluate proposition-based matching as V2 milestone for structurally hard doctrines
- [ ] Warm circle state decision after 2 weeks of observability data
- [ ] Per-drill classification review after Step 6 rewrites ship
- [ ] Lane B drills: periodic benchmark re-evaluation for tier promotion from semantic_review to standard
