
Test-first with agents
TDD with an LLM that wants to skip it. Three enforcement patterns that work.
Why agents skip tests
Agents are trained to produce working code. The pre-training reward signal heavily weights “the code runs and does what was asked.” It does not heavily weight “a test was written first that proved the requirement.”
The result: when you ask an agent to ship a feature, it will write the implementation and then maybe write tests that match. The tests pass — they were written against the implementation — but they don’t encode the requirement. They encode whatever the implementation happened to do.
Real example from the QC repo: the dirt-path P1 blocker for Phase 0 retry. The requirement was “off_board events must fire on pair-window expiration.” If an agent had implemented and then written tests, the tests would have asserted whatever the (possibly buggy) implementation did. Writing the failing test FIRST made the requirement load-bearing — the test was the contract, the implementation was what made it pass.
TDD with agents is not a religion. It’s a mechanical enforcement layer for “the test encodes the requirement, not the implementation.” Once you’ve been burned by tests-after-the-fact (everyone has), the failing-test-first rhythm is what fixes it.
Three enforcement patterns
Telling the agent “please write tests first” doesn’t work. Mechanical enforcement does. Three patterns, each catches a different failure mode.
| Pattern | Where it lives | What it enforces |
|---|---|---|
| Pre-tool-use hook | .claude/hooks/pre-edit.sh | Edits to src/ require a paired edit in tests/ in the same session. Blocks at the tool-call level. |
| Contract gate | Sprint contract Done Means row | Every Done Means row cites a test by name. The contract closes only when the named test passes. |
| Failing-test-first prompt | CLAUDE.md or session-start prompt | For any new feature, agent writes the test, runs it, confirms failure, then waits for go. Stops the “skip the failing step” pattern. |
Use all three. The hook catches when the agent tries to silently change tests to match implementation. The contract gate catches when the work claims to ship without a named test. The prompt catches the day-to-day rhythm. Skip any one and the discipline degrades.
Five failure patterns
Tests written after the fact
The skip-the-failing-step
The 100%-coverage theater
The integration-only loop
The flaky tolerated test
Pattern #5 (tolerated flakiness) is the meta-failure. Once you tolerate one flaky test, you’ve trained the loop to ignore failures. The next real failure also gets ignored. Quarantine or fix — never tolerate.
The failing-test-first prompt
Paste this before any new feature. The agent writes the test, runs it, confirms the failure, and waits. The wait is the discipline.
Before writing any implementation code for this feature, write
the failing test first.
Steps:
1. Write the test that asserts the desired behavior. Be specific
about inputs and expected outputs.
2. Run the test. Confirm it FAILS, and that the failure is for
the right reason (missing capability, not a typo).
3. Show me the failure output verbatim.
4. STOP. Wait for me to confirm before writing implementation.
If you cannot make the test fail (because the implementation is
trivial or the test is poorly scoped), say so — that's a signal
the feature is too small for TDD or the test is wrong.
Output:
- The test code
- The failure output
- Confirmation: "Ready to implement. Awaiting go."The pause between failure and implementation is what makes TDD work.
Notice the third step: “Show me the failure output verbatim.” That’s the proof step. Without it, the agent might claim the test fails without actually running it. The verbatim output is the receipt.
QC dirt-path TDD walkthrough
The actual rhythm on the QC Phase 0 P1 blocker. Failing test → minimum implementation → green → boundary cases → full suite green. Four turns, one passing test that becomes contract evidence.
The test is the contract; the implementation is what makes it pass.
- operator
QC Phase 0 retry. The dirt-path trigger fix is the P1 blocker — off_board outcomes need to fire on pair-window expiration. Before you write any code, write the failing test.
- agent
Reading the FSM specs at
04-Quantum-Caddy/specs/fsm-perception.mdand the existing test suite attests/test_pair_window.pyto understand the structure.Draft test:
# tests/test_dirt_path.py import pytest from qc.fsm.perception import PairWindow from qc.events import OffBoardEvent def test_off_board_fires_on_pair_window_expiration(): """P1 blocker for Phase 0 retry. When a bag is detected leaving the throw zone but no on-board landing event arrives before pair-window expires, an off_board event must fire automatically. """ pw = PairWindow(timeout_ms=2000) pw.start_throw(throw_id="t1", t_ms=0) # No landing event arrives. Advance past timeout. events = pw.tick(t_ms=2500) off_board = [e for e in events if isinstance(e, OffBoardEvent)] assert len(off_board) == 1, "Expected exactly one off_board event" assert off_board[0].throw_id == "t1"Running it now to confirm it fails:
$ pytest tests/test_dirt_path.py -v FAILED tests/test_dirt_path.py::test_off_board_fires_on_pair_window_expiration AttributeError: 'PairWindow' object has no attribute 'tick'Good — the test fails. The PairWindow class doesn't have a tick method that emits OffBoardEvent on timeout. That's the implementation gap. I'll write the implementation next.
- operator
Show me the implementation diff. Test passes when?
- agent
Implementation in
src/qc/fsm/perception/pair_window.py:class PairWindow: def __init__(self, timeout_ms: int): self.timeout_ms = timeout_ms self._open_throws: dict[str, int] = {} # throw_id -> start_t_ms def start_throw(self, throw_id: str, t_ms: int) -> None: self._open_throws[throw_id] = t_ms def tick(self, t_ms: int) -> list[Event]: events = [] expired = [ tid for tid, start in self._open_throws.items() if t_ms - start >= self.timeout_ms ] for tid in expired: events.append(OffBoardEvent(throw_id=tid, t_ms=t_ms)) del self._open_throws[tid] return eventsRunning the test:
$ pytest tests/test_dirt_path.py -v PASSED tests/test_dirt_path.py::test_off_board_fires_on_pair_window_expirationGreen. The failing-then-passing transition is the contract met. Now I'll add the missing-cases tests — bag lands before timeout (no off_board), multiple throws in flight, timeout exactly at boundary.
- operator
Add the boundary cases. Then run the full test_pair_window.py suite to make sure nothing else broke.
- agent
Added three more tests:
-test_no_off_board_when_bag_lands_in_window— bag landing event before timeout cancels the off_board
-test_multiple_throws_independent_timeouts— three throws in flight, each times out independently
-test_off_board_fires_at_exact_timeout_boundary— t_ms == timeout_ms boundaryFull suite:
$ pytest tests/test_pair_window.py -v ============================= 14 passed in 0.42s =============================14/14 green. Pre-existing 10 tests still pass. New 4 tests pass. The dirt-path P1 blocker for Phase 0 retry is resolved per the sprint contract Done Means row.
Note:
pytest tests/test_dirt_path.pyis now in the Phase 0 retry contract's Evidence Required section. Future regressions will trip it.
Prompt kit
Three prompts: failing-test-first, behavior-not-execution review, flaky test triage. Save in your CLAUDE.md.
Before writing any implementation code for this feature, write
the failing test first.
Steps:
1. Write the test that asserts the desired behavior. Be specific
about inputs and expected outputs.
2. Run the test. Confirm it FAILS, and that the failure is for
the right reason (missing capability, not a typo).
3. Show me the failure output verbatim.
4. STOP. Wait for me to confirm before writing implementation.
If you cannot make the test fail (because the implementation is
trivial or the test is poorly scoped), say so — that's a signal
the feature is too small for TDD or the test is wrong.
Output:
- The test code
- The failure output
- Confirmation: "Ready to implement. Awaiting go."Walk this test file. For each test:
1. What behavior does it assert? Name the specific input → output
relationship being verified.
2. Or does it just verify the function runs without error?
(That's not a test, that's a smoke check.)
3. If the implementation were quietly broken, would this test
catch it?
Output a list of tests that pass the behavior bar and tests that
are smoke checks pretending to be tests. Smoke checks don't count
toward the sprint contract's test-coverage row.Test <name> fails ~10% of the time. Walk it.
Possible causes:
- Race condition (multiple threads, async ordering)
- Time-dependent assertion (e.g., asserts on system clock)
- Network call (external API, retry not handled)
- Test pollution (state from another test bleeding in)
- Floating-point comparison without tolerance
Identify the cause. Propose either:
(a) the structural fix (preferred)
(b) a quarantine — mark @pytest.mark.flaky with a tracking
issue, blocked from sprint-contract evidence rows until fixed
Tolerated flakiness is not an option. Pick (a) or (b).Apply this
45-minute exercise. Failing test first on a real feature. Audit the existing suite. Fix or quarantine the flakies.
TDD with an agent that wants to skip it
Each step takes 5–15 minutes. Progress saves automatically.
- 01Pick the next feature you're about to build. Write the failing test first.Force yourself to make it fail before you write any implementation. Confirm the failure is for the right reason.
- 02Audit your existing test suite. How many tests assert behavior vs how many just check execution?Use the behavior-not-execution prompt. Smoke checks aren't bad — they just don't count as tests.
- 03Add a sprint-contract row: "All Done Means rows cite a passing test by name."If a Done Means row can't cite a test, it's not verifiable — and the contract drifts back to vibes.
- 04Triage every flaky test. Either fix it or quarantine it.Tolerated flakiness trains the agent (and you) to ignore real failures. Pick fix or quarantine — never tolerate.
- 05Add a pre-commit hook that blocks tests/ edits without a matching src/ edit (and vice versa).The hook enforces the failing-test-first rhythm. Test files only change with implementation files; implementation files only change with test files.