TPL-2026-001·preprint·2026-04-30

Detecting Eval-Train Overlap in Production CV Pipelines: A Lightweight Audit Protocol

TruPath Labs Research · TruPath Ventures · Stanley, NC

CVevaluationdata-leakagemethodology

Abstract

Computer vision systems shipped to production frequently report misleading evaluation metrics due to undetected overlap between training and held-out eval data. We present a sub-second hash-based audit protocol that runs adversarially in the eval pipeline and alarms on any F1 measurement above a configurable suspicion threshold. Applied retroactively to four production CV systems, the audit identified silent overlap in two of four pipelines with overlap rates between 6.2% and 14.8%, each producing inflated F1 scores of 0.97-0.99 that collapsed to 0.71-0.83 once overlap was removed.

1. Introduction

Production computer vision systems are routinely shipped on the basis of held-out evaluation metrics that, on close inspection, derive from datasets that overlap with the training corpus. The most common form of overlap is silent — neither the team nor the eval pipeline detects it — and the resulting F1, mAP, or accuracy figures are inflated by a margin large enough to mask catastrophic real-world failures.

Existing literature covers train-test overlap extensively in academic ML (see Kaufman et al. 2012; Recht et al. 2019), but production CV teams rarely run overlap audits. The gap is partly cultural — production engineers are not trained as statisticians — and partly tooling: there is no widely deployed audit tool that runs as a pre-publication gate on eval results.

We present a lightweight audit protocol that runs adversarially in any CV evaluation pipeline. The protocol uses perceptual hashing of training and eval images, computes pairwise similarity, and flags F1 measurements above a configurable threshold (default 0.95) for human review. Applied retroactively to four production CV pipelines across two organizations, the audit identified silent overlap in two of four systems, with overlap rates between 6.2% and 14.8% producing inflated F1 scores of 0.97-0.99 that fell to 0.71-0.83 once overlap was removed.

Our contributions:

A sub-second hash-based audit protocol suitable for inline use in any eval pipeline.
An empirical study of four production CV systems showing the prevalence and magnitude of silent overlap.
A configurable F1-suspicion threshold whose calibration we discuss in §6.

Subscribers only · continued

The rest of TPL-2026-001 is for subscribers.

Detecting Eval-Train Overlap in Production CV Pipelines: A Lightweight Audit Protocol

Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
Every research paper — methodology, figures, tables, reproducibility appendices
New Expert lessons + papers as they ship (quarterly cadence)
Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Become a subscriber — free →Already a subscriber? Sign in

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.