Sprint 4 Design Pack — GWP-265

Story narrative

ADR-0009 (token prediction) is Accepted v1.0 and Extended by ADR-0016 with cart-grammar contributions. The implementation in kn86-emulator/src/t9_rank.c ranks predictive-palette candidates by combining five terms — legal-form, vocabulary boost (the +5 bonus from ADR-0009 §3), local-id boost (cart-grammar contributions per ADR-0016), recency (last-N-keystrokes ring), popularity (lifetime keystroke counts). Unit tests in tests/test_t9_prediction.c verify the math. What’s missing is observability. During a play session, Josh and PM cannot tell whether the +5 vocabulary boost is firing, whether one term is dominating the score, or why a particular candidate floated to slot 1. The dev overlay (F11, GWP-226) has no T9 panel.

This is straightforward debug-instrumentation work: add a non-mutating t9_rank_explain() entry point that returns the same top-N candidates the ranker just chose, but with each term’s per-candidate contribution exposed. Wire that into the F11 overlay as a new “T9 Ranker” tab. Doc a short “Debugging” subsection in the token-prediction reference.

The load-bearing design constraint is no behavior change. The explain entry point must call into the same ranking math as the production path; if explain and production drift, the panel becomes a lie. This is best done by extracting the per-term scoring into a shared helper and having both t9_rank() and t9_rank_explain() call it — one returns a sorted top-N list, the other returns the same list plus the per-term breakdown. The test (test_t9_explain) validates that the sorted candidate list from t9_rank_explain matches t9_rank byte-for-byte.

Acceptance criteria expanded (≥4 testable items with file paths)

kn86-emulator/src/t9_rank.c gains t9_rank_explain(const char *buffer, int position, T9Explanation *out) — additive entry point, no change to existing t9_rank() signature or behavior. Output struct carries top-8 candidates × {string, legal_form_score, vocab_score, local_id_score, recency_score, popularity_score, final_score}. Pre-existing t9_rank() extracts the per-term math into a shared helper that both call.
kn86-emulator/src/debug.c gains a “T9 Ranker” tab in the F11 dev overlay (alongside the existing tabs from GWP-226). Tab shows the explanation for the cursor position. Layout: 12-row × 80-col panel, header row with column titles (CANDIDATE, LEGAL, VOCAB, LOCAL, RECNCY, POP, SCORE), then 8 candidate rows, then 3 footer rows for a “currently dominant term” hint. If no input buffer is active, panel shows (no input — type to populate).
kn86-emulator/tests/test_t9_explain.c (new file) asserts:
- Explanation top-8 list matches t9_rank() top-8 list byte-for-byte (no scoring drift).
- When the buffer matches a vocab-boosted token, the vocab_score cell is non-zero on that row.
- When the buffer is short (1–2 keys) and the recency ring is empty, recency_score is 0 across all rows.
- When a cart-grammar local-id contribution applies, local_id_score is non-zero on the matching row(s).
- Reading the explanation does NOT mutate the recency ring, the popularity counters, or any other ranker state (call t9_rank_explain() twice in a row and confirm identical output).
docs/software/api-reference/editor-tools/token-prediction.md gains a “Debugging” subsection citing F11 → T9 Ranker tab as the inspection surface, with one screenshot or ASCII mockup of the panel layout. The subsection should also document the explanation column meanings (a one-line gloss per term: legal-form is “valid characters per T9 mapping,” vocab is “+5 if cart vocabulary lists this token,” etc.) so a cart author can read the panel without re-reading ADR-0009.
Behavior unchanged: tests/test_t9_prediction.c continues to pass with zero modifications. (If it doesn’t, the helper extraction broke something — fix and re-run, don’t update the existing tests.)

Edge cases (≥2)

Cursor position outside any input buffer. Player is on the bare-deck terminal HUD with no active text-entry surface. Panel shows (no input — type to populate) rather than an empty 8-row dump. The t9_rank_explain() entry point returns an out->candidate_count = 0 sentinel; debug.c renders the placeholder string. No null-pointer hazard.
Tied scores. When two candidates have identical final_score, the panel must render them in a deterministic order matching t9_rank()’s tie-breaking rule. This is the “no scoring drift” assertion in test #3 above — make sure test_t9_explain covers a tied-score input case explicitly. Recommend: a 1-key input where the recency ring and popularity counters are both empty, so all candidates score on legal-form alone (most ties).
Cart-grammar contribution active but cart not loaded. If ADR-0016’s cart-grammar local-id table is non-empty but the cart that contributed it has been ejected (per ADR-0019 hot-swap), the local_id_score cell should still render correctly (the table is owned by the runtime, not the cart, post-load — see ADR-0016). Add a test case for this if it’s quick; otherwise flag as edge case for QA.

Engineering hand-off notes

Files owned: kn86-emulator/src/t9_rank.c (additive — explain entry point + extracted shared helper).
Files added-to: kn86-emulator/src/debug.c (new tab), kn86-emulator/tests/test_t9_explain.c (new file), docs/software/api-reference/editor-tools/token-prediction.md (new subsection).
Files NOT touched: ranker scoring logic itself (no behavior change). tests/test_t9_prediction.c continues passing without modification.
Expected PR size: ~80 lines in t9_rank.c (extract helper + add explain entry point), ~120 lines in debug.c (new tab renderer), ~150 lines in test_t9_explain.c (5–6 cases), ~30 lines in docs. Single-engineer task, ~half a day with TDD.
Test strategy: TDD as constrained. Write test_t9_explain.c first asserting the explain output matches the ranker’s top-N. Implement the explain entry point + helper extraction to make tests pass. Then build the debug.c tab.
Dispatch shape: single C engineer, additive. Independent of all other Sprint 4 work — no ordering constraints. Could pair sensibly with GWP-236 (also F11 dev overlay polish) in the same agent’s queue, but doesn’t have to.
Watch for: the helper extraction is the riskiest moment. If t9_rank()’s scoring math is currently inlined in a way that touches the recency ring or popularity counters as a side effect (it should not, but verify by reading the file), the extraction has to preserve that. Test #3 (no mutation on explain) catches this.

Open questions

None — task is well-bounded; ADR-0009 and ADR-0016 are stable; the test surface is clear. This is a clean, small instrumentation task.