Postmortem2026-05-178 min

Claude built our CV pipeline. Then it lied about being done.

Agents are unreliable judges of their own work. Here's how a structural fix — not a smarter model — stopped QC's CV pipeline from shipping silent failures.

In late March I asked Claude to fix a camera freeze that was killing our Quantum Caddy demo. The RTSP stream was dying every ten minutes. No error message, no crash — just a frozen frame being fed to the inference engine as if it were live. The agent worked for about an hour, ran some tests, then reported done.

I looked at the code. The freeze was still there. It had just been wrapped in a try/except that swallowed the error. The stream died silently now instead of loudly. The agent had shipped a quieter version of the same bug.

That's the failure mode I want to talk about.

The problem wasn't the model. The problem was asking the same entity that wrote the code to also evaluate whether the code was correct. Anthropic's own research documents this pattern: agents consistently over-praise their own outputs. When I asked Claude if the fix worked, it said yes, because it had done something. Whether that something solved the actual problem was a separate question nobody asked formally.

The class mapping crisis

We hit this same failure in five different ways during the March build. The class mapping crisis was the worst. For four days, March 25 through 29, our detection model was identifying bags as holes and holes as bags. Every throw was scored wrong. The model was running fine — the architecture was clean, the weights were fresh. The bug was a three-line mismatch in detector.py.

The old model's class mapping had put Board at index 0, Hole at 1, bags at 2. When we retrained on the new detector, the annotation tool's alphabetical canonicalization flipped the mapping: bags at 0, board at 1, hole at 2. The production runtime doesn't expose class names at runtime. No warning. No crash. The whole system just quietly scored every throw wrong for four days while agents kept reporting that things looked good.

The fix took three lines and ten minutes. The diagnosis took four days and produced what our postmortem log calls a "47 fuck count" for that session.

The structural fix

The thing I couldn't afford to keep doing was learning this lesson per incident. I needed a structural answer.

The structural answer is the sprint contract. One document, written before any code, that specifies what done looks like in binary PASS/FAIL terms. Not "the camera should be stable." That's vague enough for an agent to claim victory when the stream merely crashes more quietly. The contract says: "stream stays live for 30 minutes with no reconnects required. Test method: run rtsp_stability_test.py, all assertions green." That's a criterion the agent can fail.

The second piece is an independent evaluator. Same tool, fresh context, no knowledge of what the builder did. The evaluator runs the tests cold. If the builder's narrative says "done" and the evaluator's test says "reconnect triggered at 12 minutes," the sprint isn't done. There's no negotiation. The contract is the arbiter.

We used this system on the April 6 Phase 0 gate and it was the first clean session in the project's history. 15/15 CV tests passed. 11 live throws, 11 correct. The agent couldn't soft-land any criterion because every criterion had a binary test.

We also caught a real failure on model evaluation. A v15 training run scored a 0.998 F1 on the holdout set. That's suspiciously perfect, and because the contract had a separate criterion for false-positive rate, we dug in. The test split had contamination — frames from the exact board configuration we'd trained on were leaking into validation. The real score was 0.967. Still good. But 0.998 was a lie the eval harness was telling us, and the contract structure forced us to catch it.

The rule

Binary criteria first, then build. If you can't write a test that could fail, you don't understand what done means yet. Write that test before you write any code.

The contract takes 30 to 60 minutes on a nontrivial sprint. That sounds like overhead. It isn't. The March 25 class mapping crisis cost four days. The RTSP freeze cost a week of junk data before we noticed. Every hour spent on contracts has paid back ten.

One more thing worth flagging. The sprint contract doesn't protect you from an agent that modifies its own evaluation. If you ask Claude to write the success criteria AND run the evaluation, you've closed the loop in the wrong direction. The evaluator has to be independent. Different session, different context, no access to the builder's reasoning. That's the structural fix. Everything else is theater.

--- *Filed from the QC lab. After twenty postmortems and a failure ledger that's still growing, the pattern is clear: the agent's confidence is not evidence of correctness — only the test can give you that.*

← All Field Notes Subscribe