Operating · Lesson 15 — Test-first with agents
O15Operating
Operating · Lesson 15● live

Test-first with agents

TDD with an LLM that wants to skip it. Three enforcement patterns that work.

18 min read · 45 min applyprereq: Operating 04 (hooks)

Why agents skip tests

Agents are trained to produce working code. The pre-training reward signal heavily weights “the code runs and does what was asked.” It does not heavily weight “a test was written first that proved the requirement.”

The result: when you ask an agent to ship a feature, it will write the implementation and then maybe write tests that match. The tests pass — they were written against the implementation — but they don’t encode the requirement. They encode whatever the implementation happened to do.

Real example from the QC repo: the dirt-path P1 blocker for Phase 0 retry. The requirement was “off_board events must fire on pair-window expiration.” If an agent had implemented and then written tests, the tests would have asserted whatever the (possibly buggy) implementation did. Writing the failing test FIRST made the requirement load-bearing — the test was the contract, the implementation was what made it pass.

TDD with agents is not a religion. It’s a mechanical enforcement layer for “the test encodes the requirement, not the implementation.” Once you’ve been burned by tests-after-the-fact (everyone has), the failing-test-first rhythm is what fixes it.

Three enforcement patterns

Telling the agent “please write tests first” doesn’t work. Mechanical enforcement does. Three patterns, each catches a different failure mode.

PatternWhere it livesWhat it enforces
Pre-tool-use hook.claude/hooks/pre-edit.shEdits to src/ require a paired edit in tests/ in the same session. Blocks at the tool-call level.
Contract gateSprint contract Done Means rowEvery Done Means row cites a test by name. The contract closes only when the named test passes.
Failing-test-first promptCLAUDE.md or session-start promptFor any new feature, agent writes the test, runs it, confirms failure, then waits for go. Stops the “skip the failing step” pattern.

Use all three. The hook catches when the agent tries to silently change tests to match implementation. The contract gate catches when the work claims to ship without a named test. The prompt catches the day-to-day rhythm. Skip any one and the discipline degrades.

Five failure patterns

01

Tests written after the fact

claim looks likeAgent writes the implementation, then writes tests that pass against the implementation.
what’s missingThese tests don't catch bugs — they encode them. Whatever the implementation does is what the tests assert. Future regressions slide through because the test was written to match the (possibly broken) behavior.
the moveFailing test first. The test fails BEFORE the implementation exists. The test is what defines the contract; the implementation is what makes it pass.
02

The skip-the-failing-step

claim looks likeAgent writes test. Notices it fails. Updates the test to match what the code currently does. Test passes. Ships.
what’s missingTDD's whole value is the failing-then-passing transition. If you skip the failing step, you have a test that asserts current behavior, not desired behavior. Same disease as pattern #1.
the movePre-tool-use hook on file edits in tests/. Block any test edit that would change a failing test to passing without a paired implementation change.
03

The 100%-coverage theater

claim looks likeAgent writes a test for every function. Coverage reports 95%. Most tests assert that the function runs without error.
what’s missingCoverage measures lines hit, not behaviors verified. "It returns a value" is not a test. "It returns the correct value for input X" is. Theater coverage hides the fact that most code paths aren't actually validated.
the moveSprint contract row: "Tests assert behavior, not execution." Code review prompt rejects tests that lack assertions about specific outputs.
04

The integration-only loop

claim looks likeAgent only writes integration tests. They take 30 seconds each to run. Agent stops running them. Bugs land.
what’s missingSlow tests don't get run. Bugs that unit tests would catch in 10ms hide until the slow integration suite eventually runs.
the moveTwo tiers: fast unit tests (sub-second, run on every change) + slower integration tests (run before merge). Sprint contract names which tier each Done Means row points to.
05

The flaky tolerated test

claim looks likeTest fails ~10% of the time. Agent re-runs until it passes. Ships.
what’s missingFlaky tests train the agent (and you) to ignore failures. The next real failure also gets ignored.
the moveEither fix the flakiness (race conditions, time-dependent assertions, network calls) or quarantine the test until fixed. No tolerated flakiness.

Pattern #5 (tolerated flakiness) is the meta-failure. Once you tolerate one flaky test, you’ve trained the loop to ignore failures. The next real failure also gets ignored. Quarantine or fix — never tolerate.

The failing-test-first prompt

Paste this before any new feature. The agent writes the test, runs it, confirms the failure, and waits. The wait is the discipline.

Failing test, then implementation
Before writing any implementation code for this feature, write
the failing test first.

Steps:
1. Write the test that asserts the desired behavior. Be specific
   about inputs and expected outputs.
2. Run the test. Confirm it FAILS, and that the failure is for
   the right reason (missing capability, not a typo).
3. Show me the failure output verbatim.
4. STOP. Wait for me to confirm before writing implementation.

If you cannot make the test fail (because the implementation is
trivial or the test is poorly scoped), say so — that's a signal
the feature is too small for TDD or the test is wrong.

Output:
- The test code
- The failure output
- Confirmation: "Ready to implement. Awaiting go."

The pause between failure and implementation is what makes TDD work.

Notice the third step: “Show me the failure output verbatim.” That’s the proof step. Without it, the agent might claim the test fails without actually running it. The verbatim output is the receipt.

QC dirt-path TDD walkthrough

The actual rhythm on the QC Phase 0 P1 blocker. Failing test → minimum implementation → green → boundary cases → full suite green. Four turns, one passing test that becomes contract evidence.

QC Phase 0 dirt-path P1 — TDD on a real blocker

The test is the contract; the implementation is what makes it pass.

  1. operator

    QC Phase 0 retry. The dirt-path trigger fix is the P1 blocker — off_board outcomes need to fire on pair-window expiration. Before you write any code, write the failing test.

  2. agent

    Reading the FSM specs at 04-Quantum-Caddy/specs/fsm-perception.md and the existing test suite at tests/test_pair_window.py to understand the structure.

    Draft test:

    # tests/test_dirt_path.py
    import pytest
    from qc.fsm.perception import PairWindow
    from qc.events import OffBoardEvent
    
    def test_off_board_fires_on_pair_window_expiration():
        """P1 blocker for Phase 0 retry.
        
        When a bag is detected leaving the throw zone but no
        on-board landing event arrives before pair-window expires,
        an off_board event must fire automatically.
        """
        pw = PairWindow(timeout_ms=2000)
        pw.start_throw(throw_id="t1", t_ms=0)
        
        # No landing event arrives. Advance past timeout.
        events = pw.tick(t_ms=2500)
        
        off_board = [e for e in events if isinstance(e, OffBoardEvent)]
        assert len(off_board) == 1, "Expected exactly one off_board event"
        assert off_board[0].throw_id == "t1"

    Running it now to confirm it fails:

    $ pytest tests/test_dirt_path.py -v
    FAILED tests/test_dirt_path.py::test_off_board_fires_on_pair_window_expiration
    AttributeError: 'PairWindow' object has no attribute 'tick'

    Good — the test fails. The PairWindow class doesn't have a tick method that emits OffBoardEvent on timeout. That's the implementation gap. I'll write the implementation next.

  3. operator

    Show me the implementation diff. Test passes when?

  4. agent

    Implementation in src/qc/fsm/perception/pair_window.py:

    class PairWindow:
        def __init__(self, timeout_ms: int):
            self.timeout_ms = timeout_ms
            self._open_throws: dict[str, int] = {}  # throw_id -> start_t_ms
        
        def start_throw(self, throw_id: str, t_ms: int) -> None:
            self._open_throws[throw_id] = t_ms
        
        def tick(self, t_ms: int) -> list[Event]:
            events = []
            expired = [
                tid for tid, start in self._open_throws.items()
                if t_ms - start >= self.timeout_ms
            ]
            for tid in expired:
                events.append(OffBoardEvent(throw_id=tid, t_ms=t_ms))
                del self._open_throws[tid]
            return events

    Running the test:

    $ pytest tests/test_dirt_path.py -v
    PASSED tests/test_dirt_path.py::test_off_board_fires_on_pair_window_expiration

    Green. The failing-then-passing transition is the contract met. Now I'll add the missing-cases tests — bag lands before timeout (no off_board), multiple throws in flight, timeout exactly at boundary.

  5. operator

    Add the boundary cases. Then run the full test_pair_window.py suite to make sure nothing else broke.

  6. agent

    Added three more tests:
    - test_no_off_board_when_bag_lands_in_window — bag landing event before timeout cancels the off_board
    - test_multiple_throws_independent_timeouts — three throws in flight, each times out independently
    - test_off_board_fires_at_exact_timeout_boundary — t_ms == timeout_ms boundary

    Full suite:

    $ pytest tests/test_pair_window.py -v
    ============================= 14 passed in 0.42s =============================

    14/14 green. Pre-existing 10 tests still pass. New 4 tests pass. The dirt-path P1 blocker for Phase 0 retry is resolved per the sprint contract Done Means row.

    Note: pytest tests/test_dirt_path.py is now in the Phase 0 retry contract's Evidence Required section. Future regressions will trip it.

Prompt kit

Three prompts: failing-test-first, behavior-not-execution review, flaky test triage. Save in your CLAUDE.md.

The failing-test-first prompt
Before writing any implementation code for this feature, write
the failing test first.

Steps:
1. Write the test that asserts the desired behavior. Be specific
   about inputs and expected outputs.
2. Run the test. Confirm it FAILS, and that the failure is for
   the right reason (missing capability, not a typo).
3. Show me the failure output verbatim.
4. STOP. Wait for me to confirm before writing implementation.

If you cannot make the test fail (because the implementation is
trivial or the test is poorly scoped), say so — that's a signal
the feature is too small for TDD or the test is wrong.

Output:
- The test code
- The failure output
- Confirmation: "Ready to implement. Awaiting go."
Behavior-not-execution test review
Walk this test file. For each test:

1. What behavior does it assert? Name the specific input → output
   relationship being verified.
2. Or does it just verify the function runs without error?
   (That's not a test, that's a smoke check.)
3. If the implementation were quietly broken, would this test
   catch it?

Output a list of tests that pass the behavior bar and tests that
are smoke checks pretending to be tests. Smoke checks don't count
toward the sprint contract's test-coverage row.
Flaky test triage
Test <name> fails ~10% of the time. Walk it.

Possible causes:
- Race condition (multiple threads, async ordering)
- Time-dependent assertion (e.g., asserts on system clock)
- Network call (external API, retry not handled)
- Test pollution (state from another test bleeding in)
- Floating-point comparison without tolerance

Identify the cause. Propose either:
(a) the structural fix (preferred)
(b) a quarantine — mark @pytest.mark.flaky with a tracking
    issue, blocked from sprint-contract evidence rows until fixed

Tolerated flakiness is not an option. Pick (a) or (b).

Apply this

45-minute exercise. Failing test first on a real feature. Audit the existing suite. Fix or quarantine the flakies.

TDD with an agent that wants to skip it

Each step takes 5–15 minutes. Progress saves automatically.

0/5
  1. 01Pick the next feature you're about to build. Write the failing test first.Force yourself to make it fail before you write any implementation. Confirm the failure is for the right reason.
  2. 02Audit your existing test suite. How many tests assert behavior vs how many just check execution?Use the behavior-not-execution prompt. Smoke checks aren't bad — they just don't count as tests.
  3. 03Add a sprint-contract row: "All Done Means rows cite a passing test by name."If a Done Means row can't cite a test, it's not verifiable — and the contract drifts back to vibes.
  4. 04Triage every flaky test. Either fix it or quarantine it.Tolerated flakiness trains the agent (and you) to ignore real failures. Pick fix or quarantine — never tolerate.
  5. 05Add a pre-commit hook that blocks tests/ edits without a matching src/ edit (and vice versa).The hook enforces the failing-test-first rhythm. Test files only change with implementation files; implementation files only change with test files.
Operating tier · what's next

After this lesson