Now I have everything needed. Writing the document.

  ---
  Paddock Drill Matching: System Overview, Content Quality Failure, and Remediation Plan

  1. What Drills Are — The Student Experience

  A drill is a focused practice exercise built around a single legal doctrine (e.g., Battery,
  Claim Preclusion, Adverse Possession). The student is shown:

  - A rule statement (concise, 100-220 chars)
  - A fact pattern (a short narrative, 200-400 chars) involving named parties
  - A Building Blocks panel on the right — a panel of circles organized by element, each
  circle representing one key fact the student should identify

  The student types a free-form legal analysis into a text area. As they type, circles in the
  Building Blocks panel fill in. The feedback is live, per-keystroke, and feels like the
  system is "listening." A filled circle signals: you've demonstrated this piece of the
  analysis. An empty circle signals: you haven't yet addressed this.

  After submitting, the student sees which circles lit and gets progressive disclosure of the
  full rule, definitions, and formula scaffold.

  The core UX promise: circles light when the student has actually demonstrated the concept,
  not just when they've typed related words. This promise is broken.

  ---
  2. The Three-Layer Matching Architecture

  Matching runs in three layers, each adding signal to the student's running response. A
  circle lights if any layer says it matched (union semantics). Once lit, a circle stays lit —
   "once lit stays lit" prevents flicker as the student continues typing.

  Layer 1 — Stem Matching (checkMatchGroups in matching.ts)

  Per keystroke (debounced 200ms), the student's response is tokenized and Porter-stemmed via
  stemTokenize. The resulting token set is compared against the pre-stemmed match.groups
  stored in each keyFact.

  isGroupHit: A group is "hit" if any stem in the group exactly matches any response token, or
   — for stems ≥4 chars — is within edit distance 1 (typo tolerance).

  checkMatchGroups applies a threshold:
  - 2 groups → both must hit (hitCount === 2). Strictest.
  - 3+ groups → hitCount >= ceil(n/2). If short by exactly 1, a proximity bonus (two stems
  from different groups appearing within 6 tokens of each other) can push it over.
  - 1 group → single hit suffices (legacy backward-compat path).
  - 0 groups → vacuously true.

  Layer 2 — Embedding Match (/api/paddock/drill-match)

  Per keystroke (debounced 500ms), the student's response text is sent to a server endpoint.
  The server computes a vector embedding of the rolling response and compares it via cosine
  similarity against pre-computed embeddings stored in each DrillKeyFact.embedding
  (384-dimensional floats, built at content-generation time). A fact lights if similarity ≥
  0.50. This catches paraphrase — a student who writes "offensive dignitary affront" can light
   a circle whose display text says "humiliated in front of colleagues," even if no stems
  overlap.

  Layer 3 — LLM Upgrade (10-second pause gate)

  If the student pauses typing for 10+ seconds with at least one dark circle remaining, the
  system asks an LLM whether the student's response semantically covers the unlit facts. This
  is the most expensive layer and is designed to fire ~1-3 times per drill session, catching
  the long tail of doctrinal paraphrase that neither stems nor embeddings caught.

  ---
  3. The Intended Content Structure

  Each DrillKeyFact has a match descriptor containing groups: string[][]. The design intent,
  expressed in the exemplar content used to train the LLM:

  {
    "display": "grabbed binder from hands",
    "match": {
      "groups": [
        ["grab", "seiz", "snatch", "pull"],
        ["binder", "folder", "document"]
      ]
    }
  }

  Design intent:
  - 2-4 groups, each representing one distinct semantic concept the keyFact requires
  - 3-5 synonyms per group — alternative words a student might use to express that concept
  - AND semantics across groups: the student must demonstrate ALL concepts (or a majority for
  3+)
  - OR semantics within a group: any synonym suffices

  For a 2-group keyFact, checkMatchGroups requires hitCount === 2 — the student must mention
  something from BOTH concept buckets. This is the strictest mode, and intentionally so: a
  circle should require demonstrating the action (grabbed/seized) AND the object
  (binder/folder). Mentioning only one is not enough.

  The MatchDescriptor type comment in types.ts even says: "Groups of synonym stems — one stem
  per concept, all groups must match for a key fact hit."

  ---
  4. What Was Actually Generated — The Broken Pattern

  80% of keyFacts in the content library (288 of 361, across all 63 drill files in 6 domains)
  were generated with the following pattern:

  {
    "display": "Equal Pay Act claim was not raised in the first action",
    "match": {
      "groups": [
        ["equal", "equals"],
        ["pai", "pay"],
        ["act", "acts"],
        ["claim", "claims"],
        ["rais", "raised"],
        ["first", "firsts"],
        ["action", "actions"]
      ]
    }
  }

  What happened: The LLM tokenized the display text word-by-word and created one group per
  word, with each group containing only the Porter stem and the original inflection. This is
  the opposite of the intended structure.

  Structural comparison:

  ┌──────────────────────────┬───────────────────────────────┬─────────────────────────────┐
  │         Property         │           Intended            │           Actual            │
  ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤
  │ Groups per keyFact       │ 2-4                           │ 5-13                        │
  ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤
  │ Entries per group        │ 3-5 synonyms                  │ 2 (stem + inflection)       │
  ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤
  │ Semantic scope of group  │ One concept, many expressions │ One specific word           │
  ├──────────────────────────┼───────────────────────────────┼─────────────────────────────┤
  │ Within-group OR coverage │ Wide (paraphrase-tolerant)    │ Narrow (near-zero synonyms) │
  └──────────────────────────┴───────────────────────────────┴─────────────────────────────┘

  ---
  5. Why This Makes Matching Too Easy

  The runtime threshold ceil(n/2) was designed for the case where there are 3-4 groups of
  synonyms. With the broken content, n is 7-13, making the threshold dangerously loose.

  Worked example: "Equal Pay Act claim was not raised in the first action" — 7 groups.

  - Runtime threshold: ceil(7/2) = 4
  - Student only needs to trigger any 4 of 7 word-checks
  - A student who writes: "The court's first action on this claim was to apply the act"
    - Hits first → group 5 ✓
    - Hits action → group 6 ✓ (note: the stem of "action" via Porter is "action")
    - Hits claim → group 3 ✓ (but "claim" is in COMMON_LEGAL_STEMS — runtime doesn't filter
  these)
    - Hits act → group 2 ✓ ("act" is in COMMON_LEGAL_STEMS)
    - hitCount = 4, threshold = 4 → circle lights
  - The student never mentioned "Equal Pay Act" as a federal statute, never said it "was not
  raised," never engaged with the preclusion issue. The circle lit on filler legal prose.

  Second example: "Prior state court action: breach of contract for $40,000 bonus — summary
  judgment for Brightfield" — 12 groups.

  - Threshold: ceil(12/2) = 6
  - Words like "prior", "state", "court", "action" appear in virtually any legal analysis
  - A student who writes "The prior state court action involved contract breach" hits 5 of
  those words immediately, needs only 1 more from ["40", "000", "bonus", "summary",
  "judgment", "brightfield"]
  - This circle lights from an extremely superficial engagement

  The proximity bonus makes it worse. If the student is short by exactly 1, hitting
  threshold-1 groups with two matching words near each other in the text pushes them over.
  With 7+ single-word groups, this bonus fires constantly.

  ---
  6. The Common-Stem Compounding Problem

  Many generated groups contain stems from COMMON_LEGAL_STEMS — the set used by collision
  detection to exempt ubiquitous legal vocabulary (defend, plaintiff, court, act, claim, rule,
   etc.).

  In the stem-matching layer, COMMON_LEGAL_STEMS is not applied. The runtime's isGroupHit
  checks all stems in the group, including common ones. So a group ["court", "courts"] lights
  when the student writes any sentence containing "court" — which is every legal answer ever
  written.

  Similarly, the "act" group lights on "acted," "action," "activity," and any word whose
  Porter stem is "act." The stem "pai" (from "pay") is short enough to potentially fuzzy-match
   unintended words.

  ---
  7. The Quality Gate — What It Catches, What It Misses

  validateDrillContent in content-quality-gate.ts enforces the following checks on keyFact
  match groups:

  Checks that fire correctly:
  - Minimum 2 groups per keyFact
  - Minimum 2 entries per group
  - Semantic independence: at least one pair of groups must have disjoint non-common stems (so
   groups covering the same concept are rejected)
  - Cross-element collision: no distinctive stem shared between two elements' match groups
  - Generic content markers, party name validation, extractive key facts

  Checks missing that would catch the broken pattern:
  - No maximum groups per keyFact. A keyFact with 12 single-word groups passes the gate.
  Adding a ceiling of 4 or 5 would have rejected all 288 broken keyFacts at generation time.
  - No minimum synonyms per group enforcing conceptual breadth. The gate requires ≥2 entries
  per group, which the broken content satisfies with ["word", "words"]. The intent was ≥2
  synonyms, but the gate cannot distinguish a stem+inflection pair from genuine synonyms.
  - Independence check does not catch word-by-word tokenization. Each single-word group has a
  unique stem (after filtering common ones), so any two groups appear independent to the gate.
   A 7-group word-by-word keyFact passes the independence check trivially.

  ---
  8. The Embedding Layer — Partial Mitigation, Not a Fix

  Layer 2 (cosine similarity against pre-computed embeddings) is semantically aware and does
  not have the word-frequency problem. A student who writes only filler prose will likely
  score below the 0.50 threshold on specific key fact embeddings.

  However:
  - Embeddings are sentence-level similarity — a response that happens to be semantically
  adjacent to many keyFact displays will score above threshold on multiple facts
  simultaneously
  - The union semantics mean stem-matching false positives cannot be corrected by embeddings —
   once stems light a circle, the embedding result is irrelevant
  - Layer 3 (LLM) fires only after 10 seconds of inactivity with unlit circles, meaning it
  adds signal but never removes stem-lit false positives

  The embedding layer mitigates the opposite problem (paraphrase not recognized), not the
  problem identified here (superficial prose matching too easily).

  ---
  9. The Correct Content Structure — What Regeneration Must Produce

  The target for all regenerated match groups:

  {
    "display": "grabbed binder from hands",
    "match": {
      "groups": [
        ["grab", "seiz", "snatch", "pull"],
        ["binder", "folder", "document"]
      ]
    }
  }

  Rules:
  1. 2-4 groups, each corresponding to one distinct semantic concept in the keyFact display
  2. 3-5 Porter-stemmed synonyms per group — words a student might plausibly write to express
  that concept
  3. No word-by-word tokenization — a 5-word display should still have 2-3 groups, not 5
  groups
  4. No single-entry "inflection only" groups — ["pai", "pay"] is not a synonym set
  5. Groups must cover different vocabulary territory — overlapping stems across groups of the
   same keyFact is a design failure (caught by the independence check)
  6. No common legal stems as the sole content of a group — ["court", "courts"] provides no
  discrimination

  ---
  10. Proposed Remediation — Three Tracks

  Track A: Quality Gate Hardening (content-quality-gate.ts)

  Add a group-count ceiling — max 4 or 5 groups per keyFact. This closes the primary gap in
  the gate and would have rejected all 288 broken keyFacts:

  if (kf.match.groups.length > 5) {
      errors.push(`Key fact "${kf.display}" in "${el.name}" has ${kf.match.groups.length}
  match groups — maximum is 5. Use 2-4 groups of synonyms, not one group per word.`);
  }

  Add a minimum-synonyms guidance check — warn when all groups have exactly 2 entries (stem +
  inflection but no real synonyms). This is advisory rather than blocking since some keyFacts
  legitimately need only a proper noun + variant.

  Track B: Prompt Strengthening (drill-prompt.ts)

  Add an explicit BAD example for the word-by-word pattern to Hard Rule 4 and the SELF-CHECK:

  ▎ BAD: "Equal Pay Act claim was not raised" → groups: [["equal","equals"], ["pai","pay"],
  ▎ ["act","acts"], ["claim","claims"], ...] ✗ — this is word-by-word tokenization. You have 5
  ▎  groups of 1 concept each. Instead, identify 2-3 distinct concepts the fact requires and
  ▎ group synonyms under each.
  ▎
  ▎ GOOD: Same display → groups: [["equal", "equaliz"], ["pai", "compensat", "wage"], ["rais",
  ▎  "assert", "brought"]] ✓ — 3 groups: one for equality, one for pay/compensation, one for
  ▎ "raised in prior action."

  Cap group count in the hard rules: update Hard Rule 4 to say "2-4 groups" (not just "≥2").

  Track C: Content Library Regeneration

  63 drill JSON files across 6 domains need their match.groups fields replaced. The embeddings
   are correct and should not change — the rebuild script should only regenerate match.groups
  for each keyFact, then re-validate through the updated gate.

  The regeneration pipeline already exists (run-content-generator.ts + quality gate). The task
   is to run it against all existing drill IDs, or write a targeted "repair" script that:
  1. Reads each drill file
  2. Sends just the display text and factPattern through the prompt with explicit instruction
  to generate correct synonym groups
  3. Validates through the updated gate (with the new group-count ceiling)
  4. Writes back only the match.groups field, preserving embedding, embeddingHash, and all
  other content

  Risk: Regenerating full drills risks changing factPattern/rule/elements. A targeted repair
  prompt that only produces match.groups for a given display text is safer. The field is
  self-contained enough to repair in isolation.

  ---
  11. Open Question — Runtime Threshold Alignment (UNIVERS-152)

  The ceil(n/2) threshold was designed assuming 3-4 groups of real synonyms, where missing one
   concept group is acceptable if the student mentions everything else. After Track A+B,
  keyFacts will have 2-4 groups. For 2-group keyFacts, the runtime requires both — strict and
  correct. For 3-4 group keyFacts, ceil(3/2)=2 (student must hit 2 of 3) and ceil(4/2)=2
  (student must hit 2 of 4) — meaning even correctly structured content has some latitude.

  Whether ceil(n/2) is the right threshold for 3-4 synonym-group keyFacts, or whether it
  should be tighter (e.g., all-but-one), is a separate design question tracked in UNIVERS-152.