Pattern2026-05-317 min

The 38-point gap: one accuracy number, twenty-one very different users

Our sign model averages 42% across signers. That average hides a range from 26% to 64% — and the thing that decides where a person lands is not the signs they make, it is who they are.

Parley's next notebook asked a question the headline accuracy number cannot answer. We had a model that scores about 45% on held-out signers, averaged across a test set. Fine. But an average across signers is a promise that the signers are interchangeable, and deaf people are not interchangeable. So we ran the model against each of the 21 signers in the dataset one at a time, holding that person completely out of training, and looked at the spread instead of the mean.

The spread is 38 points. The worst-served signer gets 25.6% of their signs recognized. The best-served gets 64.2%. Five of the twenty-one signers fall below 30% — for nearly a quarter of the people we tested, the model recognizes fewer than three signs in ten. The mean of 42% describes almost none of them.

The number that gave it away

Here is the statistic that reframed the whole thing for me. When you train the same model several times and change only the random seed, the accuracy moves by about a point. When you change which signer is held out, it moves by eleven points. The variance is not in the training noise. It is in the people. A model that looks stable when you measure it the easy way is wildly unstable across the population it is supposed to serve, and the easy measurement hides that completely.

The hypothesis I was sure of, and was wrong about

Before running the folds I wrote down three predictions, the way every Parley notebook does. Two were easy calls and both held: the spread would be large, and the worst signer would land under 30%. The third was the interesting one. I predicted that the signs the model fails on would share a property (small handshape differences, or signs made near the face where the landmarks are noisy) and that a handshape-complexity score would predict which signers struggle.

It did not. The correlation came back at essentially zero. The degraded signs do cluster into rough categories, but those categories are not the cause. The same sign swings enormously depending only on who signs it: "feet" is recognized 86% of the time for the best signers and under 4% for the worst. "orange," "scissors," "cry" all swing more than 70 points. The predictor of whether a sign is recognized is the signer, not the sign.

Why this is the honest unit

This is the concrete version of a critique the Deaf community has made for years: a published accuracy number implies the technology works, and then it does not work when an actual deaf person uses it. The mechanism is exactly this gap. The number is real and the number is an average, and the average is computed over a population the model serves very unevenly. A model that averages 42% by serving everyone moderately and a model that averages 42% by serving half the people well and failing the other quarter badly are not the same product. The mean cannot tell them apart. The per-signer distribution can.

So that is what we report now: the distribution, not the mean. And the practical consequence lands on the next phase. If we collect our own data, it has to carry signer-level metadata, because a model tuned to a pooled average will keep getting better at the signers it already serves and the gap will quietly widen. You cannot close a gap your evaluation cannot see.

← All Field Notes Subscribe