TPL-2026-025·preprint·2026-06-08

Recipe, Not Architecture: Three "Failed" Sign-Recognition Models Recover Under Per-Architecture Training

CVsign-languagetraining-recipearchitecturereproducibilitymethodology

Abstract

Architecture comparisons in sign-language recognition are routinely run under a single shared training recipe, and the resulting ranking is reported as a property of the architectures. We test that assumption directly. In an earlier signer-holdout study on the Google ISLR dataset (250 signs, 21 signers), three landmark-sequence architectures appeared to fail under a shared recipe: Squeezeformer-small collapsed to chance, and BiGRU and SPOTER were lottery-ticket, with one of three seeds training and the other two collapsing. Here we re-run each architecture under both the original shared recipe and a per-architecture tuned recipe (linear learning-rate warmup plus gradient clipping), with the full ladder on one hardware (CUDA, RTX 4090), three seeds each, every rung trained to early-stop convergence. Under a pre-registered recovery criterion (seed dispersion std/mean <= 0.10 and tuned >= 2x old), all three architectures recover. Two of them, SPOTER (45.1%) and Squeezeformer-small (46.4%), reach the competition-winning frame-transformer ceiling (45.8%) on the same honest split. A permuted-label control trains to 0.39% (chance), confirming no label leak. The reported architecture failures were substantially training-recipe artifacts, not architecture limits, and the original ranking measured the recipe more than the model. We make no state-of-the-art claim; the contribution is a methodological correction with a worked example.

Visual abstract: under a shared training recipe three landmark architectures collapse to near-chance; a per-architecture warmup and gradient clip recover all three, and two reach the competition-winning ceiling. A shuffled-label control confirms the recovery is real signal.
Visual summary of the postmortem. Synthesized from the TruPath Labs sign-language research substrate.

1. The question this paper closes

This is the fourth installment from Parley's research arm on the Google ISLR dataset. The first three are public: a hand-shape baseline, an architecture ceiling-search that put honest cross-signer accuracy at 44.7% against the roughly 83% random-split leaderboard number [9], and a 21-fold leave-one-signer-out study that found a 38.57-point fairness spread across signers.

The ceiling-search left a loose end. Of the seven architectures on the ladder, several did not behave. Squeezeformer-small collapsed to chance and stayed there. BiGRU and SPOTER were what the lottery-ticket literature [1] would call untrainable-by-init: one seed of three found a working basin, the other two sat at chance. Under a shared training recipe, one learning rate, one schedule, one set of regularization choices applied uniformly to every architecture, these looked like architecture failures. The honest version of that finding raised a question instead of answering one: were these architecture limits, or were they recipe artifacts that a shared recipe could not have distinguished from limits?

The cast, briefly: BiGRU is a bidirectional gated-recurrent sequence model that reads the landmark stream forward and backward in time; SPOTER [6] is a transformer built for sign language from body and hand pose; and Squeezeformer-small [5] is a convolution-augmented, downsample-recover transformer adapted from automatic speech recognition. The reference ceiling is the competition-winning frame-level landmark transformer.

The three failures under the shared recipe: Squeezeformer-small at 1.25%, SPOTER at 3.26%, BiGRU lottery-ticket at about 11% mean.
The three failures under the shared recipe. Squeezeformer dead at chance, SPOTER negligible, BiGRU a lottery ticket that trains on one seed of three.

That question is the whole paper. It matters beyond Parley because shared-recipe architecture comparisons are the default in applied sign-language work, and a shared recipe that happens to suit one architecture's optimization geometry will systematically flatter it and bury the others.

Measured performance equals architecture capability plus recipe fit. Holding the recipe fixed does not remove the second term; it bakes it into the result.
The confound in one line. Hold the recipe fixed across architectures and you don't remove recipe fit, you bake it into the result and label it the architecture.
Subscribers only · continued

The rest of TPL-2026-025 is for subscribers.

Recipe, Not Architecture: Three "Failed" Sign-Recognition Models Recover Under Per-Architecture Training

  • Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
  • Every research paper — methodology, figures, tables, reproducibility appendices
  • New Expert lessons + papers as they ship (quarterly cadence)
  • Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.