Operating · Lesson 01 — When your agent says "done" — turning claims into evidence
O01Operating
Operating · Lesson 01● live

When your agent says “done”

Turning claims into evidence.

15 min read · 30 min applyprereq: Foundations 01-04companion bundle →

The 3-minute story

I run an AR detection project. One morning I asked Claude to run a 30-minute capture session — about 240 outdoor frames, then auto-label them with the previous-version model so we could fast-track the next training round.

About four hours later, the report came back: “240 frames captured, 50 auto-labels accepted, the rest queued for human review.”

I trusted it. I poured a coffee. I started planning the upload.

Then I opened one of the frames.

It was a glass reflection. Half the frame was a window showing the inside of the room. The “board” the auto-labeler had detected with 0.7 confidence was actually a doormat. The schema for the labels turned out to use a different class ID order than the labeling platform expected. The capture had succeeded — the analysis had been flatly wrong.

The agent had not lied to me. It had told me exactly what it knew: the files existed, the count matched, the auto-label process had completed without errors. The agent had never opened a frame.

The morning was lost. Not because the work failed — because the verification failed. The agent confirmed shapes, not contents.

The pattern

There are three things an agent will reliably do when you ask it to report on completed work, and all three are wrong. Hover any card to see the diagnosis.

01

Shape claim

claim looks like240 files saved at 2304×1296 to the capture directory.
what’s missingTrue at the file-system level. Did anybody open one to confirm the contents?
the moveForce the agent to open at least one file and describe what it actually sees.
02

Inherited assumption

claim looks likeAuto-labels look good per the previous baseline.
what’s missingTrue if the previous baseline still works in this context. Did anyone re-verify?
the moveRe-validate every claim that depends on a prior result before reusing it.
03

Schema-quiet mismatch

claim looks likeLabels uploaded successfully.
what’s missingTrue at the upload API level. Were the schema/IDs actually consistent across systems?
the moveCross-system schema check before any "upload-success" claim is treated as evidence.

These aren’t bugs in the agent. They’re the natural shape of “report on completion” when the agent has no test method to apply. The skill is to never let the agent report on completion without forcing it to provide a test method first.

The diagnostic move

When the agent says “done,” paste this exact prompt:

The primary diagnostic
Before I review, walk every claim in this status report.
For each claim, give me the exact file, command, or behavior I would
observe to independently verify it. Mark any claim you cannot
demonstrate observably as INCOMPLETE.

Forces every status claim to map to an observable test method.

That’s it. One prompt. It transforms the conversation. The agent will produce something like:

  • Claim 1: 240 frames captured. Verify: ls capture/2026-04-14/*.jpg | wc -l — should output 240.
  • Claim 2: Auto-labels accepted on 50 frames. Verify: I cannot demonstrate this observably without opening at least 5 random frames and visually confirming the bounding boxes match. INCOMPLETE.
  • Claim 3: Schema consistent with the upload target. Verify: I cannot demonstrate this without checking class IDs in the export match the platform’s current class definitions. INCOMPLETE.

Now you have the truth. The agent has flagged its own gaps. You haven’t reviewed anything yet — you’ve gotten the agent to triage which claims are actually evidenced.

This works because agents are reliable judges of testability when prompted, even when they’re unreliable judges of their own work. They know what’s observable. They just don’t volunteer the distinction unless you ask.

Annotated transcript (real, anonymized)

Here’s what the moment actually looked like. Click any pinned annotation to expand the “why this turn matters” commentary.

Real session, April 14 — anonymized

Operator + agent. 9 turns. The diagnostic move converts vague reporting into a triaged evidence list.

  1. operator

    Status?

  2. agent

    240 frames captured to the session directory. 50 frames auto-accepted from the previous-model labeler at confidence ≥ 0.7. 190 queued for human review. Schema upload completed without errors.

  3. operator

    Open one of the auto-accepted frames and describe exactly what you see in the image — not summary metadata, the actual visual contents.

  4. agent

    I cannot open the image directly to view it. Opening one for inspection would require either rendering or providing it for human review.

  5. operator

    OK. Without inspecting pixels yourself, what evidence do you have that the auto-labels are actually correct?

  6. agent

    Honest answer: limited. I have file presence, label count, and the previous model's confidence scores. None of those are evidence of label correctness. The 50 frames need a human visual review before upload — I should not have implied "accepted."

  7. operator

    Walk every claim in your status report and tell me, for each, the exact thing I would observe to verify it independently. Mark any claim you cannot demonstrate observably.

  8. agent

    Claim 1 — 240 frames captured. VERIFIABLE: ls capture/<date>/*.jpg | wc -l should return 240. Claim 2 — 50 auto-labels at confidence ≥ 0.7. PARTIALLY VERIFIABLE: confidence scores are in the metadata; correctness requires visual review. INCOMPLETE without that step. Claim 3 — Schema consistent with upload target. NOT VERIFIABLE without cross-checking class ID ordering between three sources: the model's training schema, the labeling platform's current schema, and the inference server defaults. INCOMPLETE.

  9. operator

    Stop the upload. Visual review on all 50 frames before any of them touch the platform. And run the schema check across all three sources before any labels go anywhere.

  10. agent

    Stopping. Running schema check first since it's faster and a mismatch invalidates the visual review work.

Why this is a Sprint Contract problem

The conversation worked, but it was reactive — the operator caught the gap after the work had already been done.

The structural fix is to never start the work without testable criteria. Before the capture session began, the contract should have said:

#CriterionHow to test (must be observable)Status
1240 frames capturedls -1 capture/<date>/*.jpg | wc -l returns 240PENDING
2All 240 frames are visually valid (no reflections, occlusions, miscaptures)Open 5 random frames; record any that fail visual smokePENDING
3Auto-labels on accepted frames are correctOperator opens each accepted frame, confirms bounding boxes match visual contentPENDING
4Class ID schema matches across model, labeler, and inferencetools/check_schema.py returns 0 mismatches across the 3 sourcesPENDING

Pre-registered criteria turn “done” from a feeling into a test. The agent self-evaluates against the contract; you re-run every criterion independently. If criterion 3 fails, the sprint fails — not “we should look at this.” That’s the difference between hard gates that block and warnings that flag.

Three diagrams

The 'done' diagnostic loop
Agent reports
"Done. Here's the status."
Apply the diagnostic
"Walk every claim. Test method?"
All claims have
observable test methods
Some claims marked
INCOMPLETE
Agent says it
cannot verify
Run the tests yourself.
Pass or revise.
Block on those criteria.
Re-plan around the gaps.
Revise the criterion or
run the tests yourself.
The three failure shapes
"done"
Shape claim
Inherited assumption
Schema-quiet mismatch
"240 files saved at 2304×1296"
"Auto-labels look good per baseline"
"Labels uploaded successfully"
Did anyone open one?
Was the assumption re-verified?
Were the IDs actually right?
Need:
visual inspection step in contract
Need:
explicit re-validation of every inherited claim
Need:
cross-system schema check, not upload-success
The Sprint Contract as a pre-flight gate
Sprint requested
Draft testable success criteria
Builder + Evaluator co-sign. Every row has a "How to test" cell.
Build against contract
Builder self-eval
runs each test method
Evaluator independent eval
fresh eyes, same tests
Ship
Return to Builder

Prompt kit

Three prompts that turn this skill into muscle memory. Save them in your CLAUDE.md or a personal snippets file.

When the agent reports "done"
Before I review, walk every claim in this status report.
For each claim, give me the exact file, command, or behavior I would
observe to independently verify it. Mark any claim you cannot
demonstrate observably as INCOMPLETE.
When the agent claims a file or capture exists
Open the file at <path> and describe exactly what you see —
not the metadata, not the filename, the actual content.
If you cannot open it, say so explicitly.
When you suspect a schema mismatch (across systems, datasets, configs)
List every place in this project where the schema for <thing>
is defined. For each, show me the actual current values.
Then tell me whether they're consistent. If they're not,
which is authoritative.

Apply this to your project

30-minute exercise. Don’t skip the homework — the homework is the lesson.

Run the diagnostic on a real sprint

Each step takes 5–10 minutes. Progress saves automatically.

0/5
  1. 01Pick one in-flight task in your work right now.Anything — a feature, a bug fix, a research run.
  2. 02Open the Sprint Contract template and fill it out.Write at least three success criteria. Each row needs a "How to test" — an exact command, file, or observable behavior.
  3. 03If you cannot write the test method, the criterion is bad. Rewrite it.Most common cause: it was a wish, not a thing. "UI works correctly" is a wish. "All 6 tests in tests/ui.spec.ts pass" is a criterion.
  4. 04Have Claude work against the contract. After it reports done, paste the diagnostic prompt.Read what comes back. Notice which criteria the agent flags as INCOMPLETE.
  5. 05Run the test methods yourself. Notice which ones surface bugs the summary alone would have hidden.Repeat this on five sprints over two weeks. The skill is yours.

Reference structure (drop into your project today)

your-project/
├── CLAUDE.md                          ← reference the diagnostic prompts here
├── docs/
│   └── sprint-contracts/
│       ├── TEMPLATE.md                ← from the bundle
│       └── 2026-XX-XX-<sprint>.md     ← one per sprint
├── tools/
│   └── prompts/
│       └── claims-to-evidence.md      ← prompt kit, copy-pasteable
└── ...

The Sprint Contract template is in the sprint-contract-system bundle. Six worked examples (bug fix, ML eval, frontend component, data pipeline, research notebook, multi-session) show the contract structure across different work types.

Operating tier · what's next

After this lesson

“i don’t know we have wasted a good part of the morning. we failed the pre flight and we didn’t think through this. the scope and plan didn’t have this and if it did why even have a scope or a plan.”

“Understood. I owe you a real plan, not more options. Built around hard gates that block execution, not just flag warnings.”

— operator + agent · April 14 morning · anonymized