Now I have everything needed. Writing the document. --- Paddock Drill Matching: System Overview, Content Quality Failure, and Remediation Plan 1. What Drills Are — The Student Experience A drill is a focused practice exercise built around a single legal doctrine (e.g., Battery, Claim Preclusion, Adverse Possession). The student is shown: - A rule statement (concise, 100-220 chars) - A fact pattern (a short narrative, 200-400 chars) involving named parties - A Building Blocks panel on the right — a panel of circles organized by element, each circle representing one key fact the student should identify The student types a free-form legal analysis into a text area. As they type, circles in the Building Blocks panel fill in. The feedback is live, per-keystroke, and feels like the system is "listening." A filled circle signals: you've demonstrated this piece of the analysis. An empty circle signals: you haven't yet addressed this. After submitting, the student sees which circles lit and gets progressive disclosure of the full rule, definitions, and formula scaffold. The core UX promise: circles light when the student has actually demonstrated the concept, not just when they've typed related words. This promise is broken. --- 2. The Three-Layer Matching Architecture Matching runs in three layers, each adding signal to the student's running response. A circle lights if any layer says it matched (union semantics). Once lit, a circle stays lit — "once lit stays lit" prevents flicker as the student continues typing. Layer 1 — Stem Matching (checkMatchGroups in matching.ts) Per keystroke (debounced 200ms), the student's response is tokenized and Porter-stemmed via stemTokenize. The resulting token set is compared against the pre-stemmed match.groups stored in each keyFact. isGroupHit: A group is "hit" if any stem in the group exactly matches any response token, or — for stems ≥4 chars — is within edit distance 1 (typo tolerance). checkMatchGroups applies a threshold: - 2 groups → both must hit (hitCount === 2). Strictest. - 3+ groups → hitCount >= ceil(n/2). If short by exactly 1, a proximity bonus (two stems from different groups appearing within 6 tokens of each other) can push it over. - 1 group → single hit suffices (legacy backward-compat path). - 0 groups → vacuously true. Layer 2 — Embedding Match (/api/paddock/drill-match) Per keystroke (debounced 500ms), the student's response text is sent to a server endpoint. The server computes a vector embedding of the rolling response and compares it via cosine similarity against pre-computed embeddings stored in each DrillKeyFact.embedding (384-dimensional floats, built at content-generation time). A fact lights if similarity ≥ 0.50. This catches paraphrase — a student who writes "offensive dignitary affront" can light a circle whose display text says "humiliated in front of colleagues," even if no stems overlap. Layer 3 — LLM Upgrade (10-second pause gate) If the student pauses typing for 10+ seconds with at least one dark circle remaining, the system asks an LLM whether the student's response semantically covers the unlit facts. This is the most expensive layer and is designed to fire ~1-3 times per drill session, catching the long tail of doctrinal paraphrase that neither stems nor embeddings caught. --- 3. The Intended Content Structure Each DrillKeyFact has a match descriptor containing groups: string[][]. The design intent, expressed in the exemplar content used to train the LLM: { "display": "grabbed binder from hands", "match": { "groups": [ ["grab", "seiz", "snatch", "pull"], ["binder", "folder", "document"] ] } } Design intent: - 2-4 groups, each representing one distinct semantic concept the keyFact requires - 3-5 synonyms per group — alternative words a student might use to express that concept - AND semantics across groups: the student must demonstrate ALL concepts (or a majority for 3+) - OR semantics within a group: any synonym suffices For a 2-group keyFact, checkMatchGroups requires hitCount === 2 — the student must mention something from BOTH concept buckets. This is the strictest mode, and intentionally so: a circle should require demonstrating the action (grabbed/seized) AND the object (binder/folder). Mentioning only one is not enough. The MatchDescriptor type comment in types.ts even says: "Groups of synonym stems — one stem per concept, all groups must match for a key fact hit." --- 4. What Was Actually Generated — The Broken Pattern 80% of keyFacts in the content library (288 of 361, across all 63 drill files in 6 domains) were generated with the following pattern: { "display": "Equal Pay Act claim was not raised in the first action", "match": { "groups": [ ["equal", "equals"], ["pai", "pay"], ["act", "acts"], ["claim", "claims"], ["rais", "raised"], ["first", "firsts"], ["action", "actions"] ] } } What happened: The LLM tokenized the display text word-by-word and created one group per word, with each group containing only the Porter stem and the original inflection. This is the opposite of the intended structure. Structural comparison: ┌──────────────────────────┬───────────────────────────────┬─────────────────────────────┐ │ Property │ Intended │ Actual │ ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤ │ Groups per keyFact │ 2-4 │ 5-13 │ ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤ │ Entries per group │ 3-5 synonyms │ 2 (stem + inflection) │ ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤ │ Semantic scope of group │ One concept, many expressions │ One specific word │ ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤ │ Within-group OR coverage │ Wide (paraphrase-tolerant) │ Narrow (near-zero synonyms) │ └──────────────────────────┴───────────────────────────────┴─────────────────────────────┘ --- 5. Why This Makes Matching Too Easy The runtime threshold ceil(n/2) was designed for the case where there are 3-4 groups of synonyms. With the broken content, n is 7-13, making the threshold dangerously loose. Worked example: "Equal Pay Act claim was not raised in the first action" — 7 groups. - Runtime threshold: ceil(7/2) = 4 - Student only needs to trigger any 4 of 7 word-checks - A student who writes: "The court's first action on this claim was to apply the act" - Hits first → group 5 ✓ - Hits action → group 6 ✓ (note: the stem of "action" via Porter is "action") - Hits claim → group 3 ✓ (but "claim" is in COMMON_LEGAL_STEMS — runtime doesn't filter these) - Hits act → group 2 ✓ ("act" is in COMMON_LEGAL_STEMS) - hitCount = 4, threshold = 4 → circle lights - The student never mentioned "Equal Pay Act" as a federal statute, never said it "was not raised," never engaged with the preclusion issue. The circle lit on filler legal prose. Second example: "Prior state court action: breach of contract for $40,000 bonus — summary judgment for Brightfield" — 12 groups. - Threshold: ceil(12/2) = 6 - Words like "prior", "state", "court", "action" appear in virtually any legal analysis - A student who writes "The prior state court action involved contract breach" hits 5 of those words immediately, needs only 1 more from ["40", "000", "bonus", "summary", "judgment", "brightfield"] - This circle lights from an extremely superficial engagement The proximity bonus makes it worse. If the student is short by exactly 1, hitting threshold-1 groups with two matching words near each other in the text pushes them over. With 7+ single-word groups, this bonus fires constantly. --- 6. The Common-Stem Compounding Problem Many generated groups contain stems from COMMON_LEGAL_STEMS — the set used by collision detection to exempt ubiquitous legal vocabulary (defend, plaintiff, court, act, claim, rule, etc.). In the stem-matching layer, COMMON_LEGAL_STEMS is not applied. The runtime's isGroupHit checks all stems in the group, including common ones. So a group ["court", "courts"] lights when the student writes any sentence containing "court" — which is every legal answer ever written. Similarly, the "act" group lights on "acted," "action," "activity," and any word whose Porter stem is "act." The stem "pai" (from "pay") is short enough to potentially fuzzy-match unintended words. --- 7. The Quality Gate — What It Catches, What It Misses validateDrillContent in content-quality-gate.ts enforces the following checks on keyFact match groups: Checks that fire correctly: - Minimum 2 groups per keyFact - Minimum 2 entries per group - Semantic independence: at least one pair of groups must have disjoint non-common stems (so groups covering the same concept are rejected) - Cross-element collision: no distinctive stem shared between two elements' match groups - Generic content markers, party name validation, extractive key facts Checks missing that would catch the broken pattern: - No maximum groups per keyFact. A keyFact with 12 single-word groups passes the gate. Adding a ceiling of 4 or 5 would have rejected all 288 broken keyFacts at generation time. - No minimum synonyms per group enforcing conceptual breadth. The gate requires ≥2 entries per group, which the broken content satisfies with ["word", "words"]. The intent was ≥2 synonyms, but the gate cannot distinguish a stem+inflection pair from genuine synonyms. - Independence check does not catch word-by-word tokenization. Each single-word group has a unique stem (after filtering common ones), so any two groups appear independent to the gate. A 7-group word-by-word keyFact passes the independence check trivially. --- 8. The Embedding Layer — Partial Mitigation, Not a Fix Layer 2 (cosine similarity against pre-computed embeddings) is semantically aware and does not have the word-frequency problem. A student who writes only filler prose will likely score below the 0.50 threshold on specific key fact embeddings. However: - Embeddings are sentence-level similarity — a response that happens to be semantically adjacent to many keyFact displays will score above threshold on multiple facts simultaneously - The union semantics mean stem-matching false positives cannot be corrected by embeddings — once stems light a circle, the embedding result is irrelevant - Layer 3 (LLM) fires only after 10 seconds of inactivity with unlit circles, meaning it adds signal but never removes stem-lit false positives The embedding layer mitigates the opposite problem (paraphrase not recognized), not the problem identified here (superficial prose matching too easily). --- 9. The Correct Content Structure — What Regeneration Must Produce The target for all regenerated match groups: { "display": "grabbed binder from hands", "match": { "groups": [ ["grab", "seiz", "snatch", "pull"], ["binder", "folder", "document"] ] } } Rules: 1. 2-4 groups, each corresponding to one distinct semantic concept in the keyFact display 2. 3-5 Porter-stemmed synonyms per group — words a student might plausibly write to express that concept 3. No word-by-word tokenization — a 5-word display should still have 2-3 groups, not 5 groups 4. No single-entry "inflection only" groups — ["pai", "pay"] is not a synonym set 5. Groups must cover different vocabulary territory — overlapping stems across groups of the same keyFact is a design failure (caught by the independence check) 6. No common legal stems as the sole content of a group — ["court", "courts"] provides no discrimination --- 10. Proposed Remediation — Three Tracks Track A: Quality Gate Hardening (content-quality-gate.ts) Add a group-count ceiling — max 4 or 5 groups per keyFact. This closes the primary gap in the gate and would have rejected all 288 broken keyFacts: if (kf.match.groups.length > 5) { errors.push(`Key fact "${kf.display}" in "${el.name}" has ${kf.match.groups.length} match groups — maximum is 5. Use 2-4 groups of synonyms, not one group per word.`); } Add a minimum-synonyms guidance check — warn when all groups have exactly 2 entries (stem + inflection but no real synonyms). This is advisory rather than blocking since some keyFacts legitimately need only a proper noun + variant. Track B: Prompt Strengthening (drill-prompt.ts) Add an explicit BAD example for the word-by-word pattern to Hard Rule 4 and the SELF-CHECK: ▎ BAD: "Equal Pay Act claim was not raised" → groups: [["equal","equals"], ["pai","pay"], ▎ ["act","acts"], ["claim","claims"], ...] ✗ — this is word-by-word tokenization. You have 5 ▎ groups of 1 concept each. Instead, identify 2-3 distinct concepts the fact requires and ▎ group synonyms under each. ▎ ▎ GOOD: Same display → groups: [["equal", "equaliz"], ["pai", "compensat", "wage"], ["rais", ▎ "assert", "brought"]] ✓ — 3 groups: one for equality, one for pay/compensation, one for ▎ "raised in prior action." Cap group count in the hard rules: update Hard Rule 4 to say "2-4 groups" (not just "≥2"). Track C: Content Library Regeneration 63 drill JSON files across 6 domains need their match.groups fields replaced. The embeddings are correct and should not change — the rebuild script should only regenerate match.groups for each keyFact, then re-validate through the updated gate. The regeneration pipeline already exists (run-content-generator.ts + quality gate). The task is to run it against all existing drill IDs, or write a targeted "repair" script that: 1. Reads each drill file 2. Sends just the display text and factPattern through the prompt with explicit instruction to generate correct synonym groups 3. Validates through the updated gate (with the new group-count ceiling) 4. Writes back only the match.groups field, preserving embedding, embeddingHash, and all other content Risk: Regenerating full drills risks changing factPattern/rule/elements. A targeted repair prompt that only produces match.groups for a given display text is safer. The field is self-contained enough to repair in isolation. --- 11. Open Question — Runtime Threshold Alignment (UNIVERS-152) The ceil(n/2) threshold was designed assuming 3-4 groups of real synonyms, where missing one concept group is acceptable if the student mentions everything else. After Track A+B, keyFacts will have 2-4 groups. For 2-group keyFacts, the runtime requires both — strict and correct. For 3-4 group keyFacts, ceil(3/2)=2 (student must hit 2 of 3) and ceil(4/2)=2 (student must hit 2 of 4) — meaning even correctly structured content has some latitude. Whether ceil(n/2) is the right threshold for 3-4 synonym-group keyFacts, or whether it should be tighter (e.g., all-but-one), is a separate design question tracked in UNIVERS-152.