Teardown2026-05-316 min

Recipe beats architecture: lottery tickets in sign models

We trained seven landmark architectures three times each. Three of them worked on one seed and collapsed to near-random on the others. A single-seed comparison would have called two of those collapses a result.

Recipe beats architecture: lottery tickets in sign models

When we measured the ceiling for landmark-only sign recognition, we trained seven architectures: a frame transformer, a conformer, a dilated temporal conv, a bidirectional GRU, a graph conv, a SPOTer-style pose transformer, and a squeezeformer. Same data, same signer-holdout split, same training recipe for all of them, three random seeds each. The plan was a clean head-to-head. What we got was a lesson about how easy it is to publish luck.

What the three seeds exposed

The frame transformer was stable: 45.7%, 44.6%, 43.7% across the three seeds. You can trust that. The conformer averaged about the same but swung six points between seeds, which is already a warning. Then it got strange. The bidirectional GRU trained to 33% on one seed and collapsed to 0.4% (random chance on 250 classes) on the other two. SPOTer did the same: one seed at 36%, two dead. The squeezeformer collapsed on all three.

If I had run one seed and reported it, I could have written "BiGRU reaches 33%" or "SPOTer reaches 36%" and put them on a leaderboard next to the frame transformer. Both claims would be true for that seed and false two times out of three. The single-seed number does not distinguish an architecture that works from an architecture that occasionally works and you got lucky.

Architecture or recipe?

The honest reading of this is that under one shared recipe, most of these architectures are fragile, and a single shared recipe is not a fair test of any of them. The squeezeformer is a close cousin of the conformer, yet the conformer trained reliably and the squeezeformer never did. That is almost certainly a recipe interaction (learning-rate warmup, gradient clipping, initialization), not a statement that the squeezeformer is a worse model. The shared recipe favored one architecture and starved the others.

Which means the head-to-head I wanted to run is not actually answerable from this experiment. I can report which architectures are stable under a shared recipe, and I can report the ceiling the best stable one reaches. I cannot report "architecture A beats architecture B," because B may just need a recipe I did not give it. That distinction is the whole finding.

The rule it leaves

Two things carry forward. Report mean and standard deviation over several seeds, always, because a collapse you can hide is a collapse you will eventually trip over. And treat a single shared recipe across very different architectures as a confound, not a control: the next notebook is a pre-registered tuning study that gives each architecture its own recipe and asks whether the lottery-ticket collapses were the model or the setup.

That study is running now, and the first architecture through it already answered. BiGRU, which collapsed on one of three seeds under the shared recipe, trains stably across all three seeds once it gets its own learning-rate warmup and gradient clipping, landing around 37% every time with no collapse. For BiGRU the collapse was the setup, not the model. SPOTer and Squeezeformer are still training, so the full verdict is pending. But the narrow claim from the original notebook holds regardless: under one shared recipe these architectures did not train reliably, and a one-seed paper would never have told you.