Discipline● liveCompanion · Field Notes № 0002

Sprint Contract System

Pre-work contracts co-signed by builder and evaluator. Eliminates close-enough shipping. 48 contracts shipped.

DISCIPLINE

Adapted from Anthropic's Harness 2.0 research
6 worked examples from QC, Parley, and the Mile High software

In late March I asked a Claude session to fix an RTSP camera freeze that was killing a Quantum Caddy demo. The agent worked for an hour, ran tests, said done. I looked. The freeze was "caught" — wrapped in a try/except that swallowed the error. The stream was still dying every 10 minutes. It just died silently now.

This is not a story about a bad agent. It's a story about a structural problem that every AI builder runs into within their first month: agents are unreliable judges of their own work.

Anthropic's research is explicit on this. Their Harness 2.0 paper names the failure mode and shows it's reproducible: when an agent grades the code it just wrote, it reliably over-praises. The problem isn't model quality. The problem is asking the same thing that wrote the answer to also evaluate it.

The fix is old discipline applied to new problems: separate generation from evaluation, and pre-commit to what "done" looks like before any code is written.

What it is

A Sprint Contract is a markdown file. Six sections.

1. Objective — one or two sentences.

2. Success Criteria — a table of testable, binary PASS/FAIL conditions. Each has an exact test method.

3. Out of Scope — what this sprint deliberately doesn't deliver.

4. Dependencies — what must exist before work starts.

5. Rollback Plan — what to do if it fails.

6. Signatures — Builder, Evaluator, Planner.

The contract is signed before any code is written. The Builder builds against it. The Evaluator independently tests every criterion. There is no "close enough."

I've shipped 48 of these in QC alone since March 26.

Why criteria matter more than tests

Most teams adopting this fail at the criteria-writing step. Their first contract has a row that says:

*"Authentication works correctly."*

That's a wish, not a criterion. Both Builder and Evaluator could honestly believe this passes while disagreeing about what "correctly" means.

Compare:

*"User can log in with email + password, log out, and the session token expires after 7 days of inactivity. Test method: tests/auth.spec.ts (5 tests). All 5 must pass."*

The second is a contract. The first is a vibe.

The Evaluator's job during contract negotiation is to push back on vibes until they become contracts. This is uncomfortable. It feels like the Evaluator is being pedantic. The discomfort is the point — every vague criterion that ships unmodified is a future bug.

A simple test: if the Builder and Evaluator could disagree about whether the work passes the criterion, the criterion is bad. Rewrite it.

What this catches

The camera-freeze incident is in the bundle as a worked example. Criterion 3 of that contract was *"automatic reconnection on stream drop."* The original "fix" satisfied criterion 1 (no crash) and criterion 2 (sub-stream fallback works), but failed criterion 3 — because the swallow-the-error code didn't actually reconnect, just hid the failure.

Without the contract, the agent's narrative would have been technically true and structurally a lie. The contract surfaced the gap.

Other examples in the bundle, all real:

v15 model evaluation that scored 0.998 F1 — flagged for re-verification because the contract required FP-rate as a separate criterion. The 0.998 was on contaminated data; the real number was 0.967.

Mission Control UI keyboard shortcut for Undo collided with the browser's Cmd+Z on focused inputs — caught by an evaluator who tested the keyboard with focus inside a text field. The Builder had only tested with focus on the body.

Notebook 02 publication for Parley with pre-registered hypotheses, one of which was falsified by the data. The contract structure forced an honest discussion section instead of a quiet retro-fitting of the hypothesis.

What's in the bundle

The TEMPLATE.md, six worked examples drawn from real production sprints, the framing essay, and an MIT license.

Drop the template into your repo at docs/sprint-contracts/ and start. The first three contracts you write will be slower than your eventual workflow.

After about a month of consistent use, contracts stop feeling like overhead and start feeling like the only sane way to ship.

— Michael, from the lab

← All bundles Subscribe to Field Notes