Training a custom CV model with Claude: the data quality lesson we learned the hard way.
Clean data beats model size. Every time. Don't upgrade the model until you've audited the labels.

April 16, 11pm. I'm sitting in front of the Mac Studio looking at two numbers. v11 training run: 72.5% mAP50. v15 training run: 99.3% mAP50. Same RT-DETR-S architecture. Same training code. Same hyperparameters, same learning rate schedule. Different dataset.
v15 ran on 505 frames. I hand-annotated 195 of them myself. v11 ran on a larger legacy dataset that had never been properly audited.
That number — 99.3 versus 72.5 on the same model with the same code — is the only argument I need when someone asks me whether data quality matters. It's a 37 percentage point gap from dataset hygiene alone.
How we got there
The first training runs in March were disasters, but for reasons that had nothing to do with data. We used the wrong Docker image on Vast.ai, the wrong learning rate, the wrong CLI. Six consecutive failed Lambda Labs runs. Another six on Vast.ai. The postmortem for March 17 is just a list of configuration errors. Embarrassing to read now.
When we finally got a successful run — v4, March 18, 97.8% mAP on the holdout set — I thought we were in good shape. That number looked clean. It wasn't. We hadn't validated sensor match. We'd trained on iPhone footage almost entirely, but the production camera was a Reolink. When we checked the dataset on March 28, only 42 of 4,148 training images had come from the Reolink. The model had never learned what a Reolink frame looked like. In testing at the board it failed completely on the deployment sensor.
That's the 47-fuck-count postmortem session. The lesson written into the project after that: validate dataset-camera match before any training run. Doesn't matter how clean the labels are if the sensor distribution is wrong.
The Ground Zero fix
The class mapping crisis hit the same week. For four days, March 25 through 29, every throw was scored wrong. The model thought bags were holes and holes were bags. Not because the labels were wrong in the training data — because detector.py had the class index mapping hardcoded from the old v8 model, and RT-DETR with Roboflow outputs alphabetical class order. CoreML doesn't expose `.names` at runtime so nothing threw an error. The fix was three lines. The diagnosis took four days. We now call it the Ground Zero fix: `{0: "bags", 1: "board", 2: "hole"}`, lowercase, alphabetical, never deviate.
After Ground Zero the model was structurally correct, but the scores were still unreliable in demo conditions. Board flicker during bag landings. Double-counts. Phantom detections when I walked up to retrieve bags.
The v15 dataset
The training data wasn't clean. We had decorated boards — floral wraps, sponsor logos — and the model had been trained on plain boards almost exclusively. Every Happy Hour-style board with decorations produced phantom bag detections. Logo C7 wraps triggered the hole class constantly. So we started the explicit negative mining work: pull frames from those exact board types, label the board decorations as "board" negatives, retrain. That's what went into the v15 dataset alongside the 195 frames I annotated by hand.
505 frames. All verified. Every frame from the Reolink. Every decorated board variation we owned represented. Every retrieval motion (me walking up to the board) included and labeled correctly as background.
99.3% mAP50.
The annotation workflow we used was slower than automated labeling but more accurate for our specific problem. Roboflow for the annotation interface, manual review for every board/hole region, explicit pass-through of any frame with potential class confusion. I can annotate about 60 frames an hour at that level of care. 195 frames took a bit over three hours spread across two sessions. That investment paid back in model accuracy within the same day we ran the training.
What Claude helped with and what it didn't
It helped debug training runs. When a Vast.ai onstart script was silently failing, Claude diagnosed the dependency resolution order in about ten minutes. When eval metrics looked wrong, it spotted the sensor leakage in the test split after I described the distribution. It's excellent at this kind of structured diagnostic work where the problem has a shape.
It didn't help with annotation judgment. Which bags in a cluster are occluded, where exactly the board edge terminates in a low-light frame, whether a rounded bag corner against the hole rim should be labeled as "bag" or as "hole region" — those calls required a human who understood the scoring rules. I made every one of those manually. The model reflects my annotation decisions, including my mistakes.
The rule
Before you change the model architecture, change the dataset. Before you increase epochs, audit the labels. Before you add more data, check that the data you have matches the sensor you're deploying to. The performance ceiling isn't usually where you think it is.
v11 was held back by bad labels, not by architecture. A v12 would have hit the same ceiling. We didn't need a bigger model. We needed a cleaner 505 frames.
--- *Filed from the QC lab. The three hours I spent annotating v15 by hand were the highest-ROI three hours on this project so far.*