Labs / Playbook / honest-sign-language-research
Discipline● liveCompanion · Field Notes № 0020

The Honest Sign-Language Research Playbook

The discipline behind Parley's notebooks: question-first contracts, signer-holdout splits, multi-seed floors, failure-modes-first, and a Deaf-community honesty checklist. The rules that make a 45% you can trust beat an 85% you can't.

The Honest Sign-Language Research Playbook
  • Pre-registered hypotheses + signer-holdout eval, from real notebooks
  • Report the per-signer distribution, not the pooled mean
GitHub publication pendingRead the Field Note →

Sign-language AI has a credibility problem, and it is self-inflicted. The field reports accuracy numbers in the 80s and 90s; deaf users try the products and watch them fail. The gap is not fraud. It is methodology — splits that leak signers, single-seed results dressed up as findings, pooled averages that hide who the model fails. Parley is a small public research arm, and its only real asset is that its numbers are trustworthy. This is the discipline that keeps them that way. None of it is novel. All of it is skippable, and skipping any of it is how an honest 45% turns into a dishonest 85%.

1. Question first: the notebook contract

No notebook starts with code. It starts with a contract: one question, the hypotheses, and the exit criteria, written down and locked before the first model runs. The exit criteria are binary and testable — not "understand the data" but "an unfamiliar reader can describe the dataset's shape, missing-data conventions, and signer coverage after one read."

The load-bearing move is pre-registering hypotheses. Before running a study, write down what you expect and the threshold that would confirm or refute it. In the signer-robustness study we registered three: that the per-signer spread would exceed 25 points, that the worst signer would fall below 30%, and that handshape complexity would predict per-signer degradation. The first two confirmed. The third was refuted at an r-squared of 0.008 — and that refutation was the most valuable result in the notebook, because it killed the explanation we would otherwise have reached for. You only get that protection if the prediction was on record before the data came in. A hypothesis written after the result is just a description.

2. The signer firewall

The single decision that separates honest sign-recognition numbers from theater is the split. Most reported accuracy is measured where the same signers appear in training and test. The model memorizes how specific people sign and is rewarded for recognizing them again. The number that produces is real and measures the wrong thing.

The fix is a signer-holdout split: signers are partitioned so that no one in the test set contributed any training example. We lock a 17/2/2 split (train/validation/test by signer) as the default, and for fairness work we run leave-one-signer-out: hold each signer out in turn, train on the rest, and report every fold. The split decision is recorded as a one-line architecture decision so it cannot quietly drift later. When a result depends on a methodological choice, the choice goes in writing next to the result.

3. The statistical floor

A single training run is not a result. The floor is three seeds minimum, reported as mean and standard deviation. This is not pedantry. Train the same landmark architectures three times changing only the seed and some of them reach a usable accuracy on one seed and collapse to near-random on the others. A single-seed report cannot distinguish a model that works from a model that occasionally works and you got lucky. The standard deviation is what exposes the lottery ticket.

A floor also means baselines. Every result is reported against random chance, a majority-class predictor, and a trivial model, so a number can be read as "how far above doing nothing" rather than as an absolute. On 250 classes, random is 0.4% — context that makes a 45% mean look very different from how it would read alone.

4. Failure modes are first-class output

A notebook that only reports its headline number is hiding most of what it learned. The deliverable includes the failures: confusion matrices, the signs that collapse, the per-signer breakdown, the runs that did not train. We report collapsed runs rather than discarding them, because a collapse you quietly drop is a collapse that ships. The "what's next" section is grounded in those failures, not in a wishlist — the next study is whatever the current failure modes make most urgent.

5. Report the distribution, not the mean

This is the rule the fairness study earned. A model that averages 42% across signers can be serving everyone moderately, or serving half its signers well and failing a quarter of them below 30%. The average cannot tell those apart; the per-signer distribution can. For an accessibility technology the hidden part of an average is exactly the part that matters — who gets left out. So the unit of report is the distribution: per-signer, per-class, with the spread stated plainly. A 38-point gap between your best- and worst-served user is not a footnote to the accuracy number. It is the more important number.

6. The Deaf-community honesty checklist

Parley is hearing-built, which means it is at constant risk of repeating the patterns the Deaf community has critiqued for a decade. The checklist is run as a self-audit on every public artifact:

  • **Accuracy theater:** Do we ever quote a number measured on a split that shares signers? If yes, it is not a real-world number and must be labeled.
  • **Capturing motion, calling it translation:** When we describe what the model does, do we state what it cannot do — the facial grammar and non-manual markers landmarks miss — with equal weight?
  • **Interpreter-replacement framing:** Have we implied the system replaces a human interpreter? It does not, and the copy must never say so.
  • **Nothing about us without us:** Is the direction shaped by Deaf input, or by hearing-engineer guesses dressed as defaults?

Failing any line means stopping and correcting, not shipping with a caveat.

7. Cadence and completion

The arm runs one notebook a month in its research phases, and a started notebook gets finished before the next one begins. Slow-but-complete beats fast-but-abandoned, because a half-finished study teaches nothing and a backlog of them is just guilt. Published means public — exploratory scratch work stays out of the published set, so the public record is only results that cleared the floor. The cadence is the forcing function; the floor is the quality gate; the contract is what makes both legible a month later when you have forgotten the details.

The point

These rules are why Parley can publish a 45% and call it the honest number. Each one removes a way to fool yourself, and the field's inflated numbers are mostly the sum of those self-deceptions left unremoved. The discipline is portable: the split logic, the seed floor, the pre-registered hypotheses, and the distribution-over-mean rule apply to any model that has to work for a population, not just to sign language. The only hard part is running them when a bigger number is one shortcut away.