Operating · Lesson 17 — Code review with agents
O17Operating
Operating · Lesson 17● live

Code review with agents

The review prompt that catches what humans miss. The escalation rules.

18 min read · 40 min applyprereq: Operating 12 (branch hygiene)

What review catches that humans don’t

Human review catches taste, structure, naming, the obvious correctness bugs. Agent review, with the right prompt, catches the layer underneath: sprint-contract scope violations, memory-invariant conflicts, missing timeouts, hardcoded paths, tests that assert execution instead of behavior, security regressions.

The reason isn’t that agents are smarter. It’s that agents have read the sprint contract, the memory entries, and the related files into the same context window as the diff. A human reviewer rarely loads all that — they’re usually reviewing the diff in isolation. Context-loaded review finds context-dependent bugs.

The other reason: agents don’t get tired. The 12-file, 380-line diff that a human eye glazes over is the diff where bugs hide. The agent reads every line with the same attention. That’s where the marginal catches come from.

Real example from the QC sensor pivot: the radar driver wire-up commit had two blockers (hardcoded UART path, no read timeout) and a sprint-contract violation (missing tests on calibration.py). A human reading 12 files at midnight would have missed at least one. The agent caught all three because it loaded the contract first.

The 6-axis review

Six axes. Each catches a different class of failure. Agents tend to over-index on style if you don’t name the other five.

AxisWhat it catchesWhat humans usually miss
1. CorrectnessOff-by-ones, edge cases, error handlingEdge cases that aren’t in the active diff but are in adjacent code that this diff calls.
2. ScopeFiles touched outside sprint-contract scopeThe “while we’re in there” refactor that wasn’t in the contract.
3. SecuritySecrets, injection, auth gaps, privilege escalationTokens accidentally committed to logs or config files.
4. PerformanceN+1, blocking IO without timeout, unbounded memoryMissing timeouts on external calls (the QC radar bug).
5. TestsNew behaviors without tests; tests that don’t assert behaviorCoverage-theater tests that don’t actually catch the bug they pretend to.
6. TasteNaming, structure, comments, future-maintainer costStyle is what humans see. Structure is what humans miss.

Plus two cross-cutting checks: scope vs sprint contract and memory invariant check. The memory invariant check is the differentiator — it’s what catches diffs that violate rules you previously codified but forgot.

Five failure patterns

01

The rubber-stamp review

claim looks likeOperator runs a review prompt. Agent says "looks good." Operator merges.
what’s missingGeneric review prompts produce generic answers. Agents are biased toward approval — saying "this is bad" is socially expensive in their training data. Without explicit invitation to find issues, they don't.
the moveInvite criticism explicitly. "Find the three things wrong with this diff. There are always three things." The framing changes the output.
02

The line-by-line nitpick

claim looks likeReview focuses on style, naming, indentation. Misses the actual structural bug.
what’s missingStyle is the cheapest dimension. The expensive dimensions — correctness, security, scope creep — get less attention because they require thinking about the system, not the syntax.
the moveMulti-axis review. Six axes: correctness, scope, security, performance, tests, taste. Style is one part of taste. The other five axes are where the real bugs live.
03

The review with no escalation

claim looks likeAgent finds a bug. Says so. Operator doesn't see the message buried in 800 words of "looks good." Bug ships.
what’s missingLong review outputs hide the actionable signal. The agent-found-a-bug message gets lost. Without escalation, the review is theater.
the moveStructured output with explicit BLOCKERS / CONCERNS / NITS sections. BLOCKERS are merge-blocking. CONCERNS need response before merge. NITS are optional. The structure surfaces the signal.
04

The review without context

claim looks likeReviewing the diff in isolation, no read of related files, no understanding of the sprint contract.
what’s missingA perfectly correct change to a file can violate a sprint-contract Out-of-Scope row. A test that asserts something locally can break a memory-stored invariant. Context-free review only catches local bugs.
the moveReview prompt loads the active sprint contract, the relevant memory entries, and the related files. The diff is reviewed in context, not in isolation.
05

The review-by-the-author

claim looks likeAgent that wrote the code reviews its own diff.
what’s missingAuthor bias is real. The same blind spots that produced the bug also miss the bug in review. The agent's mental model is too aligned with the implementation.
the moveReview by a fresh agent session. Different context, different priors. If you're solo, the fresh session is the closest thing to a second pair of eyes.

Pattern #5 (review-by-the-author) is the most under-recognized. Solo operators run into it constantly because there’s no other reviewer. The fresh-session rule is the workaround: spawn a new agent that hasn’t seen the implementation conversation, give it the diff and the context, get the review.

The review prompt

Six axes, structured output, escalation tiers. Save in your CLAUDE.md or as a slash command.

6-axis code review
Review this diff. Active sprint contract: <path>.
Related memory entries: <paths>.

Run the 6-axis review:

1. CORRECTNESS — does the code do what it claims? Edge cases?
   Off-by-ones?
2. SCOPE — are all changed files in the sprint contract scope?
   Flag any out-of-scope file.
3. SECURITY — secrets in commits, injection vectors, auth gaps,
   privilege boundaries.
4. PERFORMANCE — obvious perf issues (N+1, blocking IO without
   timeout, unbounded memory).
5. TESTS — does the diff ship with tests for new behaviors? Do
   the tests assert behavior, not execution?
6. TASTE — naming, structure, comments, future-maintainer cost.

Output format:
- BLOCKERS (merge-blocking): numbered list with file:line
- CONCERNS (needs response before merge): numbered list
- NITS (optional): numbered list
- Out-of-scope check: pass/fail
- Memory invariant check: any rules in memory that this diff
  conflicts with?

Be specific. "Looks good" is not a review. There are always three
things worth flagging — find them.

Catches scope creep, memory-invariant violations, and the long-diff bugs human review misses.

The “there are always three things worth flagging — find them” framing matters. Without it, agents default to approval. With it, they look harder.

Real QC commit walkthrough

The XM125 radar driver wire-up commit on QC. 12 files, 380 lines, three real BLOCKERS / CONCERNS. The review catches what a tired human would have missed.

QC sensor pivot — radar driver review

Two blockers (hardcoded path, no timeout) + one sprint-contract violation (missing tests). All three caught by the prompt.

  1. operator

    I just landed a commit on QC for the sensor pivot — wired the new XM125 radar driver path under src/qc/sensors/radar/. Diff is ~12 files, 380 lines. Run a review before I integrate with the bench-test rig.

  2. agent

    Loading: sprint contract memory/sprint-contracts/active.md, sensor strategy doc 04-Quantum-Caddy/Hardware/SENSOR-STRATEGY-RESET-2026-04-22.md, and the diff. Running 6-axis review.

    ## Code Review — XM125 Radar Driver Wire-Up
    Commit: <hash> · Files: 12 · Lines: +380 / -42
    
    ### BLOCKERS (merge-blocking)
    1. **Hardcoded UART path `/dev/ttyUSB0` in `radar_driver.py:27`.** Bench rig may have multiple USB devices; ttyUSB0 is not deterministic across boots. Pull from config.
    2. **No timeout on radar.read().** If the XM125 fails to respond, the read blocks indefinitely. The CV pipeline at :8642 will hang. Add a 100ms timeout with a structured error.
    
    ### CONCERNS (needs response before merge)
    3. **`src/qc/sensors/radar/calibration.py` lacks tests.** Sprint contract Done Means row "sensor module changes ship with passing tests" not satisfied. Two new public functions, zero new tests in the diff.
    4. **The `RadarFrame.merge()` method silently drops out-of-order frames.** This is probably right (radar is temporally ordered), but no comment explains why. Future maintainer will assume it's a bug and add reordering logic.
    
    ### NITS (optional)
    5. **Inconsistent naming:** `radar_driver.py` vs `tof_sensor.py` (existing). Pick one — _driver or _sensor.
    6. **Magic number `0x4147` in radar protocol header — name it `XM125_FRAME_MAGIC`.
    
    ### Out-of-scope check
    All 12 files are under `src/qc/sensors/radar/` or `tests/sensors/radar/`. Sprint contract scope: bench-test wire-up. ✅
    
    ### Memory invariant check
    From `memory/feedback-qc-sensor-constraints.md`: "Smart board must ship as self-contained consumer product." The radar driver assumes USB. For the consumer product path, USB is fine (boards ship with the radar pre-attached). For the venue-mount path it'd be a problem. Flagging for awareness, not blocking.
  3. operator

    Fix the two blockers. Skip the NITs. For the test gap (concern #3), draft the missing tests.

  4. agent

    Two blockers fixed in commit <hash+1>:
    - radar_driver.py:27 now reads UART path from config.sensor.radar.port. Default /dev/ttyUSB0 preserved as fallback with deprecation warning.
    - radar.read() now takes timeout_ms: int = 100. Returns RadarReadTimeout exception on expiry. Added test test_radar_read_times_out_cleanly.

    Drafting tests for calibration.py:

    # tests/sensors/radar/test_calibration.py
    def test_zero_point_calibration_returns_offset():
        cal = RadarCalibration()
        cal.record_zero(distance_m=0.5)
        assert cal.offset_m == 0.5
    
    def test_calibration_clears_on_reset():
        cal = RadarCalibration()
        cal.record_zero(distance_m=0.5)
        cal.reset()
        assert cal.offset_m == 0.0

    Four tests total. Running: 4/4 green. Concern #3 is now closed. Concern #4 is documentation — adding the comment explaining out-of-order drop. NITs left for the next cleanup pass.

    Ready to merge per the merge-etiquette rule (agent prepares, human lands).

  5. operator

    Quick second pass on the timeout fix. Did the agent introduce any test that asserts the timeout fires correctly under load?

  6. agent

    Yes — test_radar_read_times_out_cleanly:

    def test_radar_read_times_out_cleanly():
        """100ms timeout fires when XM125 doesn't respond."""
        driver = RadarDriver(port="/dev/null", timeout_ms=100)
        start = time.monotonic()
        with pytest.raises(RadarReadTimeout):
            driver.read()
        elapsed = time.monotonic() - start
        assert 0.09 < elapsed < 0.15, "Timeout fired outside expected window"

    Two assertions:
    1. The exception fires (not a hang or a different exception)
    2. The timeout fires at the right time (within 50% tolerance of 100ms)

    The timing assertion is what catches a class of bugs where the timeout is implemented but doesn't actually limit the wait.

Prompt kit

Three prompts: the 6-axis review, the author-bias check, the pre-merge final gate. Save in your CLAUDE.md.

The 6-axis review prompt
Review this diff. Active sprint contract: <path>.
Related memory entries: <paths>.

Run the 6-axis review:

1. CORRECTNESS — does the code do what it claims? Edge cases?
   Off-by-ones?
2. SCOPE — are all changed files in the sprint contract scope?
   Flag any out-of-scope file.
3. SECURITY — secrets in commits, injection vectors, auth gaps,
   privilege boundaries.
4. PERFORMANCE — obvious perf issues (N+1, blocking IO without
   timeout, unbounded memory).
5. TESTS — does the diff ship with tests for new behaviors? Do
   the tests assert behavior, not execution?
6. TASTE — naming, structure, comments, future-maintainer cost.

Output format:
- BLOCKERS (merge-blocking): numbered list with file:line
- CONCERNS (needs response before merge): numbered list
- NITS (optional): numbered list
- Out-of-scope check: pass/fail
- Memory invariant check: any rules in memory that this diff
  conflicts with?

Be specific. "Looks good" is not a review. There are always three
things worth flagging — find them.
Author bias check (run after agent reviews its own code)
The agent that wrote this code also reviewed it. Run a
fresh-eyes review with these biases corrected:

- Re-read the diff WITHOUT the implementation context
- Look for "what would I be confused by if I came to this in 6
  months without the conversation history"
- Look for things the original agent might have justified but
  that don't actually hold up

Output: any new BLOCKERS or CONCERNS the original review missed.
If the original review was sufficient, say so explicitly with
reasoning.
Pre-merge final gate
Final pre-merge gate. Confirm:

1. All BLOCKERS from review are resolved (cite the fix commit)
2. All CONCERNS have a documented response (resolved or
   deferred with tracking issue)
3. Test suite green on the merge target after rebase
4. No new force-push, no history rewrite, no irreversible action
5. Merge note drafted (one paragraph: what shipped, what's
   deferred, known limitations)

Output: GO / NO-GO. If NO-GO, name what's missing.

Do not run the merge. Output the report. I'll run the merge.

Apply this

40-minute exercise. Run a real review. Calibrate against your historical review pattern.

Add agent review to your merge path

Each step takes 5–10 minutes. Progress saves automatically.

0/5
  1. 01Pick a recent commit (yours or an agent's). Run the 6-axis review prompt against it.Even on shipped code, the review will surface debt. The point is to calibrate what the prompt finds.
  2. 02Compare the review output to what you reviewed for at the time.Most operators reviewed for correctness and style. The 6-axis prompt usually catches scope and memory-invariant violations the human review missed.
  3. 03Add a CLAUDE.md line: "All commits to src/ get a 6-axis review by a fresh agent session before merge."The fresh-session rule is what gets you the second pair of eyes when you're solo.
  4. 04Set up a pre-merge hook that runs the review prompt automatically.The hook can be as simple as a script that opens a fresh agent session, pastes the diff, and saves the review output to a PR comment or local file.
  5. 05Establish escalation rules: BLOCKERS block merge, CONCERNS get a documented response, NITS are optional.Write these rules in CLAUDE.md so the agent loads them on every review session. The structure is what makes the review actionable.
Operating tier · what's next

After this lesson