# Drill Matching Fix — Remaining Work

**Status:** Steps 1-6 shipped across PRs #2138, #2139, #2140 (all MERGED)
**Date:** 2026-04-22
**Context documents:**
- `Drill Matching Fix - Complete Brief.md` (same directory — full diagnosis, evidence, design decisions)
- `Drill Matching Fix - Revised Plan.md` (same directory — 6-step plan, all steps shipped)

## Post-Merge Corrections

- **PR #2140 is fully merged.** Some items that were originally described as "deferred review comments" were actually fixed before merge.
- **`semantic_review` no longer has provisional circles.** PR #2140 removed the provisional/confirmed circle treatment before merge; the current UI is dark/light only.
- **`DrillTier` is runtime-only today.** `standard` / `guarded` / `semantic_review` are computed heuristically in `DrillView.svelte` from degenerate keyFacts and used for guard behavior/telemetry. They are not persisted per drill.
- **Operational scripts still use direct Gemini keys today.** The app/runtime prefers `AI_GATEWAY_API_KEY`, but `drill-benchmark.ts`, `drill-repair.ts`, and `drill-rewrite.ts` still read `GOOGLE_API_KEY` directly until a shared Node-safe gateway helper lands.

---

## What Was Shipped

| PR | Step | What |
|----|------|------|
| #2138 | 1 | Authoring contract: quality gate (max 3 groups, stem-canonical inflection check, party ownership/sufficiency), prompt strengthening, Zod schema, shared constants, `isGenericContent()` removal |
| #2139 | 2 | Degenerate keyFact guard in DrillView.svelte (skips stem matching for broken content), benchmark confusion-matrix harness (`scripts/drill-benchmark.ts`) |
| #2140 | 3-6 | Content repair pipeline (`scripts/drill-repair.ts`), layer attribution telemetry, DrillTier computation, AI rewrite pipeline (`scripts/drill-rewrite.ts`) |

---

## Remaining Work (in execution order)

### Issue 1: Reactive Matching — Circles Respond to Text Deletion

**Priority:** High
**Type:** Feature / behavior change
**Why it matters:** Currently, once a circle lights via embedding or LLM, it stays lit permanently ("once lit stays lit") even if the student deletes the text that triggered it. The only way circles go dark is a blanket character-count reset (text < 12 chars for embeddings, < 30 for LLM). This is wrong — if a student deletes a sentence about Intent, the Intent circle should go dark.

**What to change:**

1. **Make embedding maps non-sticky.** In `DrillView.svelte`, `runEmbeddingMatch()` currently only records positive results:
   ```typescript
   // Current (sticky — only adds true):
   if (m.lit) next.set(key, true);
   
   // Change to (reactive — reflects current text):
   next.set(key, m.lit);
   ```
   The embedding endpoint already returns results for ALL keyFacts. Recording the full state means when the student edits and embedding re-evaluates at 500ms, circles that no longer match go dark.

2. **Clear stale LLM matches when embedding disagrees.** When the embedding endpoint returns `lit: false` for a key that the LLM previously marked `true`, clear that LLM entry. Embedding is a cheaper proxy for semantic relevance — if it says the concept is gone, the LLM's old judgment is stale.
   ```typescript
   // In runEmbeddingMatch, after setting the new embedding state:
   if (!m.lit && llmFactMatches.has(key)) {
       // Embedding no longer matches — clear stale LLM result
       needsLlmClear = true;
   }
   ```
   Then create a new `llmFactMatches` map without those keys.

3. **Simplify reset logic.** The blanket character-count resets (< 12 chars, < 30 chars) become less critical since individual circles are now reactive. Keep a minimal full-reset on truly empty text (< 5 chars) as defensive behavior, but the per-circle reactivity handles the main case.

**Expected UX:** Student deletes text → stems fail instantly → embedding re-evaluates at 500ms and circle goes dark. The 500ms gap is barely perceptible.

**Files to modify:** `apps/web-svelte/src/lib/drills/DrillView.svelte`

**Test approach:** Update `DrillView.reset.test.ts` — the existing reset tests should verify that circles go dark when embedding returns false, not just when text drops below thresholds.

**Key constraint:** Do NOT touch `matching.ts` or `checkMatchGroups`. This is purely about how DrillView manages the embedding and LLM result maps.

---

### Issue 2: Residual Cleanup After PR #2140

**Priority:** Medium
**Type:** Cleanup / hardening

This section in the original handoff is partially stale. Several items originally listed here were fixed before PR #2140 merged, including:

- benchmark coverage validation in `drill-repair.ts` and `drill-rewrite.ts`
- party collision rate inclusion in the benchmark gate
- rewrite extractive checks grounded in `factPattern`
- parse-failure hard failure in repair/rewrite flows
- out-of-bounds / malformed rewrite-response handling
- explicit `syncStemAttributionFromCurrentMatches()` before submit breadcrumb emission

Any remaining cleanup should be confirmed against the actual merged code and unresolved PR threads, not this pre-merge checklist.

**Still-reasonable cleanup candidates:**

**DrillView.svelte:**
- Extract `layerAttribution`, `syncStemAttributionFromCurrentMatches`, and breadcrumb emission into a helper module if `DrillView.svelte` continues to grow.

**Operational scripts:**
- Migrate `drill-benchmark.ts`, `drill-repair.ts`, and `drill-rewrite.ts` to the same gateway-first model/key resolution strategy used elsewhere in paddock (`AI_GATEWAY_API_KEY` first, direct provider fallback second) via a Node-safe shared helper.

**Approach:** Keep this as a narrow cleanup PR. Do not reopen the removed provisional-circle work or assume pre-merge review comments are still outstanding.

**Files to modify:** `DrillView.svelte`, `scripts/drill-repair.ts`, `scripts/drill-rewrite.ts`

---

### Issue 3: Run Operational Pipelines — Generate Benchmarks, Repair Content, Analyze Overlaps

**Priority:** High
**Type:** Operational / content

This is running the scripts that were built in Steps 2-6, not writing new code.

**Execution order:**

1. **Generate benchmark bundles for all drills:**
   ```bash
   GOOGLE_API_KEY=<key> npx tsx scripts/drill-benchmark.ts generate
   ```
   This creates `.benchmark.json` files alongside each `drill-*.json`. Takes ~5 minutes (56 drills, 5 test cases per element, Gemini Flash calls). Review a few bundles to verify quality.

2. **Evaluate current content against benchmarks:**
   ```bash
   npx tsx scripts/drill-benchmark.ts evaluate
   ```
   No API calls — runs production `stemTokenize` + `checkMatchGroups` against the test bundles. Produces confusion matrices showing diagonal recall, off-diagonal bleed, and party collision rates for each drill. This is the baseline.

3. **Run repair pipeline (dry run first):**
   ```bash
   GOOGLE_API_KEY=<key> npx tsx scripts/drill-repair.ts repair --all --dry-run
   ```
   Shows what would change without writing. Review the proposed synonym groups. Then:
   ```bash
   GOOGLE_API_KEY=<key> npx tsx scripts/drill-repair.ts repair --all --write
   ```
   Only drills that pass validation AND show benchmark improvement get written. Drills without benchmark bundles can use `--skip-benchmark` for the initial pass.

4. **Analyze overlap-heavy drills:**
   ```bash
   npx tsx scripts/drill-rewrite.ts analyze
   ```
   Identifies drills with high vocabulary overlap between elements (Jaccard similarity). These are candidates for the rewrite pipeline or manual splitting.

5. **Commit repaired content:**
   The `--write` flag modifies drill JSON files in place (only `match.groups` changes — embeddings, display text, everything else preserved). Commit as a single batch:
   ```
   feat(paddock): batch repair drill match groups via benchmark-gated pipeline
   ```

**Expected outcome:** ~40 drills improved, ~14 flagged as overlap-heavy for rewrite/split treatment.

**Key files:**
- `scripts/drill-benchmark.ts` — benchmark generation and evaluation
- `scripts/drill-repair.ts` — repair pipeline
- `scripts/drill-rewrite.ts` — rewrite/split analysis
- `apps/web-svelte/src/lib/practice/content/paddock/**/drill-*.json` — the 56 drill content files

---

### Issue 4: Create Follow-Up Linear Issues for Future Work

**Priority:** Low
**Type:** Planning

After the operational pipeline runs (Issue 3), create Linear issues for:

1. **Cosine similarity threshold audit** — Stricter stems change what reaches Layer 2 (embedding). The current threshold (0.50) may need recalibration against repaired content. Use the layer attribution telemetry (shipped in PR #2140) to measure how often embeddings confirm vs contradict stem results. Wait for ~2 weeks of production data.

2. **Proposition-based matching V2** — For drills with 5-6 heavily overlapping elements (commerce clause, takings, etc.), stem matching is structurally insufficient regardless of content quality. A proposition engine with typed slots, evidence spans, and polarity detection is the right long-term architecture. Scope as a V2 milestone. The `drill-rewrite.ts analyze` output identifies which drills need this.

3. **Per-drill classification review** — After the rewrite pipeline runs, review the tier assignments (standard/guarded/semantic_review). Drills that were rewritten should be re-benchmarked to see if they can graduate from `semantic_review` to `standard`.

4. **Warm circle state decision** — The original plan included a "warm" intermediate circle state for embedding similarity 0.35-0.49. This was deferred until telemetry justifies it. After 2 weeks of layer attribution data, decide whether it's needed. If so, scope to guarded/semantic_review drills only.

**These are tracking issues, not implementation work.** Create them after Issue 3 produces results so each issue has specific data (which drills, what metrics, what gaps).

---

## Key Files Reference

| File | What it does |
|------|-------------|
| `apps/web-svelte/src/lib/drills/DrillView.svelte` | Client-side drill UI with 3-layer matching loop, degenerate guard, layer attribution |
| `apps/web-svelte/src/lib/drills/BuildingBlocksPanel.svelte` | Circle rendering (dark/light only) |
| `apps/web-svelte/src/lib/drills/types.ts` | `DrillTier` type |
| `apps/web-svelte/src/lib/practice/matching.ts` | `stemTokenize`, `checkMatchGroups`, `porterStem` |
| `apps/web-svelte/src/lib/practice/drill-constants.ts` | `MAX_GROUPS_PER_KEYFACT=3`, `MIN_GROUPS_PER_KEYFACT=2`, `COMMON_LEGAL_STEMS` |
| `apps/web-svelte/src/lib/server/paddock/content-quality-gate.ts` | `validateDrillContent()` — 6-section validator |
| `apps/web-svelte/src/lib/server/paddock/drill-prompt.ts` | `buildDrillPrompt()`, `buildRetryFeedback()` |
| `apps/web-svelte/src/lib/practice/content/paddock/**/drill-*.json` | 56 drill content files |
| `scripts/drill-benchmark.ts` | Benchmark generation + evaluation |
| `scripts/drill-repair.ts` | Lane A (groups repair) + Lane B (assignment detection) |
| `scripts/drill-rewrite.ts` | Analyze overlaps, propose splits, rewrite keyFacts |

## Design Decisions (still in effect)

- **D1:** Fix content, not algorithm. `checkMatchGroups` threshold unchanged.
- **D3:** Party names enforced in content (ownership), NOT filtered at runtime.
- **D4:** Benchmark confusion matrix is the gate. Test bundles versioned, replayed deterministically.
- **D7:** No human review. AI-automated with benchmark as backstop.
- **D8:** Observability (layer attribution) informs future routing decisions.
