TPL-2026-023·preprint·2026-05-31

The Honest Ceiling on Landmark-Only Isolated-Sign Recognition: A Signer-Holdout Audit

CVsign-languageevaluationsigner-holdoutmethodology

Abstract

Isolated-sign recognition results are routinely reported in the 80-90% range, but most are measured on signer-mixed splits where the same signers appear in both training and test data. We audit the real landmark-only ceiling on a strict signer-holdout split of the Google ISLR dataset (250 signs, a 17/2/2 train/val/test signer split, mean and standard deviation over three seeds). The best stable architecture, a frame-level transformer, reaches 44.7% top-1 (+/- 1.0); hand-shape features alone reach 36.4%, so the full body-and-face landmark set adds only about eight points over the hands. Three of seven architecture families collapsed to near-random accuracy on two or more of three seeds, an instability that single-seed reporting would hide entirely. We argue that signer-holdout evaluation with multi-seed reporting is the minimum honest protocol for this task, and that the gap between leaky-split and signer-holdout numbers is the single most important figure a sign-recognition result can disclose.

The accuracy illusion: signer-mixed (leaky) splits report 80-90% by measuring user memorization, while strict signer-holdout evaluation reveals a real ceiling near 45%; hand-shape features alone reach 36%; several architectures collapse on some seeds, so multi-seed reporting is mandatory.
Visual summary of the accuracy gap. Synthesized from the TruPath Labs sign-language research substrate. Download the slide deck (PDF).
The signer firewall: a flawed random split lets the same signers leak from training into testing, while a strict signer-holdout split keeps training signers and test signers entirely separate.
The signer firewall. A random split lets the same signers leak across training and test; a strict signer-holdout split keeps them separate, which is what makes the accuracy honest.
Architectural fragility and model collapse: under strict holdout, some training runs climb to high validation accuracy while others collapse toward the random-chance floor.
Architectural fragility. The same model can reach a strong result or collapse to near-random depending only on the random seed, which is why we report mean and standard deviation over several seeds.

1. Introduction

A reader scanning isolated-sign recognition results comes away with the impression that the problem is mostly solved. Accuracy numbers in the 80-90% range are common, both in published work and on public leaderboards for datasets like Google ISLR [1]. The impression is misleading, and the reason is the evaluation split. Most of those numbers are measured on signer-mixed data, where the same people who appear in training also appear in test. The model is rewarded for memorizing how a specific person signs, and the score it reports is not the score a deaf user in the wild would experience [3].

The honest version of the question is: how well does a landmark-only model recognize signs from people it has never seen? We answer it on the Google ISLR dataset, restricted to 250 signs, with a strict signer-holdout split and results reported as mean and standard deviation over three random seeds. We are not proposing a new architecture. We are establishing a credible ceiling and documenting the failure modes an optimistic evaluation would have hidden.

The headline result is that the best stable model we trained reaches 44.7% top-1 on held-out signers, less than half what signer-mixed reporting suggests. Two further findings matter as much as the number itself: hand-shape features alone account for most of that accuracy, and several otherwise-reasonable architectures collapsed to near-random performance on a majority of seeds.

Subscribers only · continued

The rest of TPL-2026-023 is for subscribers.

The Honest Ceiling on Landmark-Only Isolated-Sign Recognition: A Signer-Holdout Audit

  • Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
  • Every research paper — methodology, figures, tables, reproducibility appendices
  • New Expert lessons + papers as they ship (quarterly cadence)
  • Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.