
When your agent says “done”
Turning claims into evidence.
The 3-minute story
I run an AR detection project. One morning I asked Claude to run a 30-minute capture session — about 240 outdoor frames, then auto-label them with the previous-version model so we could fast-track the next training round.
About four hours later, the report came back: “240 frames captured, 50 auto-labels accepted, the rest queued for human review.”
I trusted it. I poured a coffee. I started planning the upload.
Then I opened one of the frames.
It was a glass reflection. Half the frame was a window showing the inside of the room. The “board” the auto-labeler had detected with 0.7 confidence was actually a doormat. The schema for the labels turned out to use a different class ID order than the labeling platform expected. The capture had succeeded — the analysis had been flatly wrong.
The agent had not lied to me. It had told me exactly what it knew: the files existed, the count matched, the auto-label process had completed without errors. The agent had never opened a frame.
The morning was lost. Not because the work failed — because the verification failed. The agent confirmed shapes, not contents.
The pattern
There are three things an agent will reliably do when you ask it to report on completed work, and all three are wrong. Hover any card to see the diagnosis.
Shape claim
Inherited assumption
Schema-quiet mismatch
These aren’t bugs in the agent. They’re the natural shape of “report on completion” when the agent has no test method to apply. The skill is to never let the agent report on completion without forcing it to provide a test method first.
The diagnostic move
When the agent says “done,” paste this exact prompt:
Before I review, walk every claim in this status report.
For each claim, give me the exact file, command, or behavior I would
observe to independently verify it. Mark any claim you cannot
demonstrate observably as INCOMPLETE.Forces every status claim to map to an observable test method.
That’s it. One prompt. It transforms the conversation. The agent will produce something like:
- Claim 1: 240 frames captured. Verify:
ls capture/2026-04-14/*.jpg | wc -l— should output 240. - Claim 2: Auto-labels accepted on 50 frames. Verify: I cannot demonstrate this observably without opening at least 5 random frames and visually confirming the bounding boxes match. INCOMPLETE.
- Claim 3: Schema consistent with the upload target. Verify: I cannot demonstrate this without checking class IDs in the export match the platform’s current class definitions. INCOMPLETE.
Now you have the truth. The agent has flagged its own gaps. You haven’t reviewed anything yet — you’ve gotten the agent to triage which claims are actually evidenced.
This works because agents are reliable judges of testability when prompted, even when they’re unreliable judges of their own work. They know what’s observable. They just don’t volunteer the distinction unless you ask.
Annotated transcript (real, anonymized)
Here’s what the moment actually looked like. Click any pinned annotation to expand the “why this turn matters” commentary.
Operator + agent. 9 turns. The diagnostic move converts vague reporting into a triaged evidence list.
- operator
Status?
- agent
240 frames captured to the session directory. 50 frames auto-accepted from the previous-model labeler at confidence ≥ 0.7. 190 queued for human review. Schema upload completed without errors.
- operator
Open one of the auto-accepted frames and describe exactly what you see in the image — not summary metadata, the actual visual contents.
- agent
I cannot open the image directly to view it. Opening one for inspection would require either rendering or providing it for human review.
- operator
OK. Without inspecting pixels yourself, what evidence do you have that the auto-labels are actually correct?
- agent
Honest answer: limited. I have file presence, label count, and the previous model's confidence scores. None of those are evidence of label correctness. The 50 frames need a human visual review before upload — I should not have implied "accepted."
- operator
Walk every claim in your status report and tell me, for each, the exact thing I would observe to verify it independently. Mark any claim you cannot demonstrate observably.
- agent
Claim 1 — 240 frames captured. VERIFIABLE: ls capture/<date>/*.jpg | wc -l should return 240. Claim 2 — 50 auto-labels at confidence ≥ 0.7. PARTIALLY VERIFIABLE: confidence scores are in the metadata; correctness requires visual review. INCOMPLETE without that step. Claim 3 — Schema consistent with upload target. NOT VERIFIABLE without cross-checking class ID ordering between three sources: the model's training schema, the labeling platform's current schema, and the inference server defaults. INCOMPLETE.
- operator
Stop the upload. Visual review on all 50 frames before any of them touch the platform. And run the schema check across all three sources before any labels go anywhere.
- agent
Stopping. Running schema check first since it's faster and a mismatch invalidates the visual review work.
Why this is a Sprint Contract problem
The conversation worked, but it was reactive — the operator caught the gap after the work had already been done.
The structural fix is to never start the work without testable criteria. Before the capture session began, the contract should have said:
| # | Criterion | How to test (must be observable) | Status |
|---|---|---|---|
| 1 | 240 frames captured | ls -1 capture/<date>/*.jpg | wc -l returns 240 | PENDING |
| 2 | All 240 frames are visually valid (no reflections, occlusions, miscaptures) | Open 5 random frames; record any that fail visual smoke | PENDING |
| 3 | Auto-labels on accepted frames are correct | Operator opens each accepted frame, confirms bounding boxes match visual content | PENDING |
| 4 | Class ID schema matches across model, labeler, and inference | tools/check_schema.py returns 0 mismatches across the 3 sources | PENDING |
Pre-registered criteria turn “done” from a feeling into a test. The agent self-evaluates against the contract; you re-run every criterion independently. If criterion 3 fails, the sprint fails — not “we should look at this.” That’s the difference between hard gates that block and warnings that flag.
Three diagrams
Prompt kit
Three prompts that turn this skill into muscle memory. Save them in your CLAUDE.md or a personal snippets file.
Before I review, walk every claim in this status report.
For each claim, give me the exact file, command, or behavior I would
observe to independently verify it. Mark any claim you cannot
demonstrate observably as INCOMPLETE.Open the file at <path> and describe exactly what you see —
not the metadata, not the filename, the actual content.
If you cannot open it, say so explicitly.List every place in this project where the schema for <thing>
is defined. For each, show me the actual current values.
Then tell me whether they're consistent. If they're not,
which is authoritative.Apply this to your project
30-minute exercise. Don’t skip the homework — the homework is the lesson.
Run the diagnostic on a real sprint
Each step takes 5–10 minutes. Progress saves automatically.
- 01Pick one in-flight task in your work right now.Anything — a feature, a bug fix, a research run.
- 02Open the Sprint Contract template and fill it out.Write at least three success criteria. Each row needs a "How to test" — an exact command, file, or observable behavior.
- 03If you cannot write the test method, the criterion is bad. Rewrite it.Most common cause: it was a wish, not a thing. "UI works correctly" is a wish. "All 6 tests in tests/ui.spec.ts pass" is a criterion.
- 04Have Claude work against the contract. After it reports done, paste the diagnostic prompt.Read what comes back. Notice which criteria the agent flags as INCOMPLETE.
- 05Run the test methods yourself. Notice which ones surface bugs the summary alone would have hidden.Repeat this on five sprints over two weeks. The skill is yours.
Reference structure (drop into your project today)
your-project/
├── CLAUDE.md ← reference the diagnostic prompts here
├── docs/
│ └── sprint-contracts/
│ ├── TEMPLATE.md ← from the bundle
│ └── 2026-XX-XX-<sprint>.md ← one per sprint
├── tools/
│ └── prompts/
│ └── claims-to-evidence.md ← prompt kit, copy-pasteable
└── ...The Sprint Contract template is in the sprint-contract-system bundle. Six worked examples (bug fix, ML eval, frontend component, data pipeline, research notebook, multi-session) show the contract structure across different work types.