ASR ground-truth pipelines

Building an ASR eval set from scratch when there’s no public benchmark. The Parley methodology for sign language and speech recognition.

25 min read · 2 hours applyprereq: Expert 14 (AR research sprints)

The benchmark gap

Parley’s Phase-1 COMPLETE milestone as of April 26 produced the finding that defines the project’s research contribution: the top architecture (frame_transformer) achieves 0.4467 ± 0.0097 on signer-holdout evaluation, approximately 35 percentage points below the random-split Kaggle leaderboard scores for the same dataset. Phase 1 shipped Notebooks 00, 01, and 02 — 7 architectures, 3 seeds each, full ladder on Google ISLR.

The 35-point gap is not a failure. It’s the finding. Random-split evaluation on ISLR — which is what the Kaggle leaderboard measures — allows training data from a signer to leak into the test set. If Signer 12 appears in both the training and test portions, the model memorizes Signer 12’s style and is evaluated on it. That’s not generalization. Signer-holdout evaluation holds out all clips from a subset of signers entirely. The model never sees those signers during training. The 0.4467 measures what the field hasn’t measured: generalization to novel signers.

The same gap exists in speech recognition. LibriSpeech clean produces WER numbers that don’t translate to conversational speech in ambient environments. The Parley use case — real-time transcription of conversational speech rendered on an AR HUD for a deaf wearer — is acoustically harder than any standard ASR benchmark. The lesson below covers how to build the ground-truth pipeline that measures the real performance, not the benchmark performance.

Why benchmarks lie

Standard ASR and sign language recognition benchmarks lie in two systematic ways: train-test data leakage and test-condition mismatch. Understanding both is the prerequisite to building a ground-truth pipeline that doesn’t.

Benchmark failure mode	How it inflates the number	The fix
Train-test leakage	Random split allows the same signer/speaker to appear in both training and test. Model memorizes individual style; test measures memorization.	Signer/speaker holdout: all clips from held-out individuals stay out of training entirely.
Condition mismatch	Test conditions (studio recording, read speech, controlled lighting) are easier than deployment conditions (ambient noise, conversational speech, AR camera placement).	Build a deployment-realistic eval tier. Acknowledge the gap between benchmark and deployment conditions explicitly.
Population mismatch	Benchmark population (ISLR signers, LibriSpeech readers) doesn’t match deployment population (diverse signing dialects, conversational speakers in ambient environments).	Stratify eval by demographic and dialect diversity. Measure per-group accuracy, not just aggregate.

For Parley, all three failure modes are present in the existing benchmarks. The signer-holdout protocol addresses leakage. The tiered eval set (Tier 1 controlled, Tier 2 conversational, Tier 3 AR-hardware-realistic) addresses condition mismatch. The deaf-community advisor engagement — currently waiting on the Kanban — is the population mismatch fix: native signers reviewing the evaluation methodology and the published numbers before community distribution.

Subscribers only · continued

The rest of Expert · Lesson 20 is for subscribers.

ASR ground-truth pipelines

Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
Every research paper — methodology, figures, tables, reproducibility appendices
New Expert lessons + papers as they ship (quarterly cadence)
Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Become a subscriber — free →Already a subscriber? Sign in

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.