Operating · Lesson 01 — When your agent says "done" — turning claims into evidence

O01Operating

Operating · Lesson 01● live

When your agent says “done”

Turning claims into evidence.

15 min read · 30 min applyprereq: Foundations 01-04companion bundle →

The 3-minute story

I run an AR detection project. One morning I asked Claude to run a 30-minute capture session — about 240 outdoor frames, then auto-label them with the previous-version model so we could fast-track the next training round.

About four hours later, the report came back: “240 frames captured, 50 auto-labels accepted, the rest queued for human review.”

I trusted it. I poured a coffee. I started planning the upload.

Then I opened one of the frames.

It was a glass reflection. Half the frame was a window showing the inside of the room. The “board” the auto-labeler had detected with 0.7 confidence was actually a doormat. The schema for the labels turned out to use a different class ID order than the labeling platform expected. The capture had succeeded — the analysis had been flatly wrong.

The agent had not lied to me. It had told me exactly what it knew: the files existed, the count matched, the auto-label process had completed without errors. The agent had never opened a frame.

The morning was lost. Not because the work failed — because the verification failed. The agent confirmed shapes, not contents.

The pattern

There are three things an agent will reliably do when you ask it to report on completed work, and all three are wrong. Hover any card to see the diagnosis.

№ 01

Shape claim

claim looks like“240 files saved at 2304×1296 to the capture directory.”

what’s missingTrue at the file-system level. Did anybody open one to confirm the contents?

the moveForce the agent to open at least one file and describe what it actually sees.

№ 02

Inherited assumption

claim looks like“Auto-labels look good per the previous baseline.”

what’s missingTrue if the previous baseline still works in this context. Did anyone re-verify?

the moveRe-validate every claim that depends on a prior result before reusing it.

№ 03

Schema-quiet mismatch

claim looks like“Labels uploaded successfully.”

what’s missingTrue at the upload API level. Were the schema/IDs actually consistent across systems?

the moveCross-system schema check before any "upload-success" claim is treated as evidence.

These aren’t bugs in the agent. They’re the natural shape of “report on completion” when the agent has no test method to apply. The skill is to never let the agent report on completion without forcing it to provide a test method first.

The diagnostic move

When the agent says “done,” paste this exact prompt:

The primary diagnostic

Before I review, walk every claim in this status report.
For each claim, give me the exact file, command, or behavior I would
observe to independently verify it. Mark any claim you cannot
demonstrate observably as INCOMPLETE.

Forces every status claim to map to an observable test method.

That’s it. One prompt. It transforms the conversation. The agent will produce something like:

Claim 1: 240 frames captured. Verify: ls capture/2026-04-14/*.jpg | wc -l — should output 240.
Claim 2: Auto-labels accepted on 50 frames. Verify: I cannot demonstrate this observably without opening at least 5 random frames and visually confirming the bounding boxes match. INCOMPLETE.
Claim 3: Schema consistent with the upload target. Verify: I cannot demonstrate this without checking class IDs in the export match the platform’s current class definitions. INCOMPLETE.

Now you have the truth. The agent has flagged its own gaps. You haven’t reviewed anything yet — you’ve gotten the agent to triage which claims are actually evidenced.

This works because agents are reliable judges of testability when prompted, even when they’re unreliable judges of their own work. They know what’s observable. They just don’t volunteer the distinction unless you ask.

Annotated transcript (real, anonymized)

Here’s what the moment actually looked like. Click any pinned annotation to expand the “why this turn matters” commentary.

Real session, April 14 — anonymized

Operator + agent. 9 turns. The diagnostic move converts vague reporting into a triaged evidence list.

operator
Status?
agent
240 frames captured to the session directory. 50 frames auto-accepted from the previous-model labeler at confidence ≥ 0.7. 190 queued for human review. Schema upload completed without errors.
operator
Open one of the auto-accepted frames and describe exactly what you see in the image — not summary metadata, the actual visual contents.
agent
I cannot open the image directly to view it. Opening one for inspection would require either rendering or providing it for human review.
operator
OK. Without inspecting pixels yourself, what evidence do you have that the auto-labels are actually correct?
agent
Honest answer: limited. I have file presence, label count, and the previous model's confidence scores. None of those are evidence of label correctness. The 50 frames need a human visual review before upload — I should not have implied "accepted."
operator
Walk every claim in your status report and tell me, for each, the exact thing I would observe to verify it independently. Mark any claim you cannot demonstrate observably.
agent
Claim 1 — 240 frames captured. VERIFIABLE: ls capture/<date>/*.jpg | wc -l should return 240. Claim 2 — 50 auto-labels at confidence ≥ 0.7. PARTIALLY VERIFIABLE: confidence scores are in the metadata; correctness requires visual review. INCOMPLETE without that step. Claim 3 — Schema consistent with upload target. NOT VERIFIABLE without cross-checking class ID ordering between three sources: the model's training schema, the labeling platform's current schema, and the inference server defaults. INCOMPLETE.
operator
Stop the upload. Visual review on all 50 frames before any of them touch the platform. And run the schema check across all three sources before any labels go anywhere.
agent
Stopping. Running schema check first since it's faster and a mismatch invalidates the visual review work.

Why this is a Sprint Contract problem

The conversation worked, but it was reactive — the operator caught the gap after the work had already been done.

The structural fix is to never start the work without testable criteria. Before the capture session began, the contract should have said:

#	Criterion	How to test (must be observable)	Status
1	240 frames captured	`ls -1 capture/<date>/*.jpg \| wc -l` returns 240	PENDING
2	All 240 frames are visually valid (no reflections, occlusions, miscaptures)	Open 5 random frames; record any that fail visual smoke	PENDING
3	Auto-labels on accepted frames are correct	Operator opens each accepted frame, confirms bounding boxes match visual content	PENDING
4	Class ID schema matches across model, labeler, and inference	`tools/check_schema.py` returns 0 mismatches across the 3 sources	PENDING

Pre-registered criteria turn “done” from a feeling into a test. The agent self-evaluates against the contract; you re-run every criterion independently. If criterion 3 fails, the sprint fails — not “we should look at this.” That’s the difference between hard gates that block and warnings that flag.

Three diagrams

The 'done' diagnostic loop

The three failure shapes

The Sprint Contract as a pre-flight gate

Prompt kit

Three prompts that turn this skill into muscle memory. Save them in your CLAUDE.md or a personal snippets file.

When the agent reports "done"

Before I review, walk every claim in this status report.
For each claim, give me the exact file, command, or behavior I would
observe to independently verify it. Mark any claim you cannot
demonstrate observably as INCOMPLETE.

When the agent claims a file or capture exists

Open the file at <path> and describe exactly what you see —
not the metadata, not the filename, the actual content.
If you cannot open it, say so explicitly.

When you suspect a schema mismatch (across systems, datasets, configs)

List every place in this project where the schema for <thing>
is defined. For each, show me the actual current values.
Then tell me whether they're consistent. If they're not,
which is authoritative.

Apply this to your project

30-minute exercise. Don’t skip the homework — the homework is the lesson.

Run the diagnostic on a real sprint

Each step takes 5–10 minutes. Progress saves automatically.

0/5

01Pick one in-flight task in your work right now.Anything — a feature, a bug fix, a research run.
02Open the Sprint Contract template and fill it out.Write at least three success criteria. Each row needs a "How to test" — an exact command, file, or observable behavior.
03If you cannot write the test method, the criterion is bad. Rewrite it.Most common cause: it was a wish, not a thing. "UI works correctly" is a wish. "All 6 tests in tests/ui.spec.ts pass" is a criterion.
04Have Claude work against the contract. After it reports done, paste the diagnostic prompt.Read what comes back. Notice which criteria the agent flags as INCOMPLETE.
05Run the test methods yourself. Notice which ones surface bugs the summary alone would have hidden.Repeat this on five sprints over two weeks. The skill is yours.

Reference structure (drop into your project today)

your-project/
├── CLAUDE.md                          ← reference the diagnostic prompts here
├── docs/
│   └── sprint-contracts/
│       ├── TEMPLATE.md                ← from the bundle
│       └── 2026-XX-XX-<sprint>.md     ← one per sprint
├── tools/
│   └── prompts/
│       └── claims-to-evidence.md      ← prompt kit, copy-pasteable
└── ...

The Sprint Contract template is in the sprint-contract-system bundle. Six worked examples (bug fix, ML eval, frontend component, data pipeline, research notebook, multi-session) show the contract structure across different work types.

Operating tier · what's next

After this lesson

Operating · № 02● live

Postmortem discipline — when something breaks anyway

The substitution test for structural causes. Action items that ship. The close-out protocol.

15 min read · 45 min apply

Operating · № 03● live

Memory architecture — what to remember, what to throw away

MEMORY.md as an index. Four memory types. The single rule that prevents bloat.

12 min read · 20 min apply

Operating · № 04● live

Hooks — mechanical enforcement vs. promised future discipline

SessionStart, PreToolUse, PostToolUse. Why hooks beat "agents must remember to."

18 min read · 60 min apply

Operating · № 05● live

Mid-sprint scope changes — the amendment protocol

How to add work to a contract without losing rigor. When a re-sign is required.

10 min read · 15 min apply