45%, not 90%: the only sign-recognition number I trust
Our best landmark-only sign model scores 45% on signers it has never seen. The field routinely reports numbers twice that high. The lower number is the honest one, and it is the one we publish.

Parley shipped its first modeling phase, and the headline number is 45%. That is the top-1 accuracy of our best landmark-only model on the Google sign dataset, evaluated on signers the model never saw in training. The full writeup is the landmark-only ceiling paper in the research section. This note is about why I am publishing 45% when I could publish almost double.
Here is the mechanism that produces the bigger number. Most sign-recognition results are measured on a split where the same people appear in training and test. The model memorizes how those specific signers move and gets graded on recognizing them again. Reported that way, this task lands in the high eighties or low nineties. It is a real measurement of the wrong thing. It tells you the model can re-identify known signers, not that it can recognize signs from a stranger.
When we held the signers out, our best stable model scored 44.7%, averaged over three training runs. Hand-shape features alone got 36%. Several architectures that looked fine on a single run collapsed to near-random when we changed the random seed, which means their good number was luck the first reporter happened to catch. None of that is visible if you report one seed on a signer-mixed split. All of it is visible the moment you stop.
I sat with the choice for a while, because 45% is not a number you put on a slide to raise money. The honest version of the work is less impressive than the dishonest version, and the dishonest version is the industry default. But Parley exists partly as a corrective to that default. The Deaf community has watched a decade of sign-language AI report inflated accuracy, get deployed, and fail the people it claimed to serve. If Parley is going to publish at all, the one thing it cannot do is add another inflated number to that pile.
So the rule for every Parley notebook is fixed. Evaluate on held-out signers. Report mean and standard deviation over several seeds. Publish the failures next to the successes. The number gets smaller and the work gets more trustworthy, and for a research arm whose entire value is credibility, that is the trade I want every time.
There is a version of this that sounds like false modesty, and it is not that. 45% on held-out signers for a 250-sign vocabulary is hard-won, well above the baselines, and a real starting line for the next phase. The point is not that the number is low. The point is that it is true. A true 45% is worth more to me, and to anyone who builds on it, than a 90% that evaporates the first time a new person signs.
That is the whole discipline. Report the number a deaf user would actually experience, and start the next phase from there instead of from the flattering one.