TPL-2026-024·preprint·2026-05-31

Signer-Dialect Robustness in Landmark-Only Sign Recognition: A 21-Fold Leave-One-Signer-Out Study

TruPath Labs Research · TruPath Ventures · Stanley, NC

CVsign-languagefairnesssigner-holdoutevaluation

Abstract

A single cross-signer accuracy number hides who a sign-recognition model actually serves. We run a 21-fold leave-one-signer-out study on the Google ISLR 250-sign dataset, holding out each of the 21 signers in turn and reporting the full per-signer distribution rather than a pooled average. Top-1 accuracy ranges from 25.6% on the worst-served signer to 64.2% on the best — a 38.6 percentage-point spread — with a mean of 41.7% and a per-signer standard deviation of 11.4 points, more than ten times the seed-to-seed deviation of the same architecture. Five of twenty-one signers fall below 30%. We pre-registered three hypotheses: that the spread would exceed 25 points (confirmed, 38.6), that the worst signer would fall below 30% (confirmed, 25.6%), and that handshape complexity would predict per-signer degradation (refuted, r-squared 0.008). The degradation is signer-driven, not sign-feature-driven: the same sign swings 60 to 82 points between the best and worst signers. We argue the per-signer distribution, not the pooled mean, is the honest unit of report for sign recognition, and that signer-level metadata must be captured in any future data collection.

The 38-point signer fairness gap: per-signer top-1 ranges from 25.6% to 64.2%; five of 21 signers fall below 30%; the per-signer standard deviation dwarfs the seed-to-seed deviation; and a pre-registered hypothesis that handshape complexity drives degradation was refuted, so the degradation is signer-driven, not sign-feature-driven. — Visual summary of the per-signer fairness gap. Synthesized from the TruPath Labs sign-language research substrate.

1. Introduction

A companion study established that the honest ceiling for landmark-only isolated-sign recognition, measured on held-out signers, is about 45% top-1 rather than the 80 to 90% reported on signer-mixed splits [7]. That number is a pooled average across a held-out test set. This study asks the question the average hides: when the model meets a signer it has never seen, how much does performance depend on which signer it is?

The answer is: enormously. Running the model against each of the 21 signers in turn, top-1 accuracy ranges from 25.6% on the worst-served signer to 64.2% on the best. The mean of 41.7% describes almost none of them. For a technology whose entire purpose is to work for deaf people in the wild, a 38-point spread across signers is not a footnote to the accuracy number. It is the more important number.

We pre-registered three hypotheses before running the folds, and the one that failed is the most informative. We expected handshape complexity to predict which signers the model struggles with. It does not. The variable that predicts whether a sign is recognized is who is signing it, not how hard the handshape is.

Subscribers only · continued

The rest of TPL-2026-024 is for subscribers.

Signer-Dialect Robustness in Landmark-Only Sign Recognition: A 21-Fold Leave-One-Signer-Out Study

Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
Every research paper — methodology, figures, tables, reproducibility appendices
New Expert lessons + papers as they ship (quarterly cadence)
Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Become a subscriber — free →Already a subscriber? Sign in

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.