Operating · Lesson 17 — Code review with agents

O17Operating

Operating · Lesson 17● live

Code review with agents

The review prompt that catches what humans miss. The escalation rules.

18 min read · 40 min applyprereq: Operating 12 (branch hygiene)

What review catches that humans don’t

Human review catches taste, structure, naming, the obvious correctness bugs. Agent review, with the right prompt, catches the layer underneath: sprint-contract scope violations, memory-invariant conflicts, missing timeouts, hardcoded paths, tests that assert execution instead of behavior, security regressions.

The reason isn’t that agents are smarter. It’s that agents have read the sprint contract, the memory entries, and the related files into the same context window as the diff. A human reviewer rarely loads all that — they’re usually reviewing the diff in isolation. Context-loaded review finds context-dependent bugs.

The other reason: agents don’t get tired. The 12-file, 380-line diff that a human eye glazes over is the diff where bugs hide. The agent reads every line with the same attention. That’s where the marginal catches come from.

Real example from the QC sensor pivot: the radar driver wire-up commit had two blockers (hardcoded UART path, no read timeout) and a sprint-contract violation (missing tests on calibration.py). A human reading 12 files at midnight would have missed at least one. The agent caught all three because it loaded the contract first.

The 6-axis review

Six axes. Each catches a different class of failure. Agents tend to over-index on style if you don’t name the other five.

Axis	What it catches	What humans usually miss
1. Correctness	Off-by-ones, edge cases, error handling	Edge cases that aren’t in the active diff but are in adjacent code that this diff calls.
2. Scope	Files touched outside sprint-contract scope	The “while we’re in there” refactor that wasn’t in the contract.
3. Security	Secrets, injection, auth gaps, privilege escalation	Tokens accidentally committed to logs or config files.
4. Performance	N+1, blocking IO without timeout, unbounded memory	Missing timeouts on external calls (the QC radar bug).
5. Tests	New behaviors without tests; tests that don’t assert behavior	Coverage-theater tests that don’t actually catch the bug they pretend to.
6. Taste	Naming, structure, comments, future-maintainer cost	Style is what humans see. Structure is what humans miss.

Plus two cross-cutting checks: scope vs sprint contract and memory invariant check. The memory invariant check is the differentiator — it’s what catches diffs that violate rules you previously codified but forgot.

Five failure patterns

№ 01

The rubber-stamp review

claim looks like“Operator runs a review prompt. Agent says "looks good." Operator merges.”

what’s missingGeneric review prompts produce generic answers. Agents are biased toward approval — saying "this is bad" is socially expensive in their training data. Without explicit invitation to find issues, they don't.

the moveInvite criticism explicitly. "Find the three things wrong with this diff. There are always three things." The framing changes the output.

№ 02

The line-by-line nitpick

claim looks like“Review focuses on style, naming, indentation. Misses the actual structural bug.”

what’s missingStyle is the cheapest dimension. The expensive dimensions — correctness, security, scope creep — get less attention because they require thinking about the system, not the syntax.

the moveMulti-axis review. Six axes: correctness, scope, security, performance, tests, taste. Style is one part of taste. The other five axes are where the real bugs live.

№ 03

The review with no escalation

claim looks like“Agent finds a bug. Says so. Operator doesn't see the message buried in 800 words of "looks good." Bug ships.”

what’s missingLong review outputs hide the actionable signal. The agent-found-a-bug message gets lost. Without escalation, the review is theater.

the moveStructured output with explicit BLOCKERS / CONCERNS / NITS sections. BLOCKERS are merge-blocking. CONCERNS need response before merge. NITS are optional. The structure surfaces the signal.

№ 04

The review without context

claim looks like“Reviewing the diff in isolation, no read of related files, no understanding of the sprint contract.”

what’s missingA perfectly correct change to a file can violate a sprint-contract Out-of-Scope row. A test that asserts something locally can break a memory-stored invariant. Context-free review only catches local bugs.

the moveReview prompt loads the active sprint contract, the relevant memory entries, and the related files. The diff is reviewed in context, not in isolation.

№ 05

The review-by-the-author

claim looks like“Agent that wrote the code reviews its own diff.”

what’s missingAuthor bias is real. The same blind spots that produced the bug also miss the bug in review. The agent's mental model is too aligned with the implementation.

the moveReview by a fresh agent session. Different context, different priors. If you're solo, the fresh session is the closest thing to a second pair of eyes.

Pattern #5 (review-by-the-author) is the most under-recognized. Solo operators run into it constantly because there’s no other reviewer. The fresh-session rule is the workaround: spawn a new agent that hasn’t seen the implementation conversation, give it the diff and the context, get the review.

The review prompt

Six axes, structured output, escalation tiers. Save in your CLAUDE.md or as a slash command.

6-axis code review

Review this diff. Active sprint contract: <path>.
Related memory entries: <paths>.

Run the 6-axis review:

1. CORRECTNESS — does the code do what it claims? Edge cases?
   Off-by-ones?
2. SCOPE — are all changed files in the sprint contract scope?
   Flag any out-of-scope file.
3. SECURITY — secrets in commits, injection vectors, auth gaps,
   privilege boundaries.
4. PERFORMANCE — obvious perf issues (N+1, blocking IO without
   timeout, unbounded memory).
5. TESTS — does the diff ship with tests for new behaviors? Do
   the tests assert behavior, not execution?
6. TASTE — naming, structure, comments, future-maintainer cost.

Output format:
- BLOCKERS (merge-blocking): numbered list with file:line
- CONCERNS (needs response before merge): numbered list
- NITS (optional): numbered list
- Out-of-scope check: pass/fail
- Memory invariant check: any rules in memory that this diff
  conflicts with?

Be specific. "Looks good" is not a review. There are always three
things worth flagging — find them.

Catches scope creep, memory-invariant violations, and the long-diff bugs human review misses.

The “there are always three things worth flagging — find them” framing matters. Without it, agents default to approval. With it, they look harder.

Real QC commit walkthrough

The XM125 radar driver wire-up commit on QC. 12 files, 380 lines, three real BLOCKERS / CONCERNS. The review catches what a tired human would have missed.

QC sensor pivot — radar driver review

Two blockers (hardcoded path, no timeout) + one sprint-contract violation (missing tests). All three caught by the prompt.

operator
I just landed a commit on QC for the sensor pivot — wired the new XM125 radar driver path under src/qc/sensors/radar/. Diff is ~12 files, 380 lines. Run a review before I integrate with the bench-test rig.

agent

Loading: sprint contract memory/sprint-contracts/active.md, sensor strategy doc 04-Quantum-Caddy/Hardware/SENSOR-STRATEGY-RESET-2026-04-22.md, and the diff. Running 6-axis review.

## Code Review — XM125 Radar Driver Wire-Up
Commit: <hash> · Files: 12 · Lines: +380 / -42

### BLOCKERS (merge-blocking)
1. **Hardcoded UART path `/dev/ttyUSB0` in `radar_driver.py:27`.** Bench rig may have multiple USB devices; ttyUSB0 is not deterministic across boots. Pull from config.
2. **No timeout on radar.read().** If the XM125 fails to respond, the read blocks indefinitely. The CV pipeline at :8642 will hang. Add a 100ms timeout with a structured error.

### CONCERNS (needs response before merge)
3. **`src/qc/sensors/radar/calibration.py` lacks tests.** Sprint contract Done Means row "sensor module changes ship with passing tests" not satisfied. Two new public functions, zero new tests in the diff.
4. **The `RadarFrame.merge()` method silently drops out-of-order frames.** This is probably right (radar is temporally ordered), but no comment explains why. Future maintainer will assume it's a bug and add reordering logic.

### NITS (optional)
5. **Inconsistent naming:** `radar_driver.py` vs `tof_sensor.py` (existing). Pick one — _driver or _sensor.
6. **Magic number `0x4147` in radar protocol header — name it `XM125_FRAME_MAGIC`.

### Out-of-scope check
All 12 files are under `src/qc/sensors/radar/` or `tests/sensors/radar/`. Sprint contract scope: bench-test wire-up. ✅

### Memory invariant check
From `memory/feedback-qc-sensor-constraints.md`: "Smart board must ship as self-contained consumer product." The radar driver assumes USB. For the consumer product path, USB is fine (boards ship with the radar pre-attached). For the venue-mount path it'd be a problem. Flagging for awareness, not blocking.

operator
Fix the two blockers. Skip the NITs. For the test gap (concern #3), draft the missing tests.
agent
Two blockers fixed in commit <hash+1>:
- radar_driver.py:27 now reads UART path from config.sensor.radar.port. Default /dev/ttyUSB0 preserved as fallback with deprecation warning.
- radar.read() now takes timeout_ms: int = 100. Returns RadarReadTimeout exception on expiry. Added test test_radar_read_times_out_cleanly.
Drafting tests for calibration.py:
```
# tests/sensors/radar/test_calibration.py
def test_zero_point_calibration_returns_offset():
    cal = RadarCalibration()
    cal.record_zero(distance_m=0.5)
    assert cal.offset_m == 0.5

def test_calibration_clears_on_reset():
    cal = RadarCalibration()
    cal.record_zero(distance_m=0.5)
    cal.reset()
    assert cal.offset_m == 0.0
```
Four tests total. Running: 4/4 green. Concern #3 is now closed. Concern #4 is documentation — adding the comment explaining out-of-order drop. NITs left for the next cleanup pass.
Ready to merge per the merge-etiquette rule (agent prepares, human lands).
operator
Quick second pass on the timeout fix. Did the agent introduce any test that asserts the timeout fires correctly under load?

agent

Yes — test_radar_read_times_out_cleanly:

def test_radar_read_times_out_cleanly():
    """100ms timeout fires when XM125 doesn't respond."""
    driver = RadarDriver(port="/dev/null", timeout_ms=100)
    start = time.monotonic()
    with pytest.raises(RadarReadTimeout):
        driver.read()
    elapsed = time.monotonic() - start
    assert 0.09 < elapsed < 0.15, "Timeout fired outside expected window"

Two assertions:
1. The exception fires (not a hang or a different exception)
2. The timeout fires at the right time (within 50% tolerance of 100ms)

The timing assertion is what catches a class of bugs where the timeout is implemented but doesn't actually limit the wait.

Prompt kit

Three prompts: the 6-axis review, the author-bias check, the pre-merge final gate. Save in your CLAUDE.md.

The 6-axis review prompt

Review this diff. Active sprint contract: <path>.
Related memory entries: <paths>.

Run the 6-axis review:

1. CORRECTNESS — does the code do what it claims? Edge cases?
   Off-by-ones?
2. SCOPE — are all changed files in the sprint contract scope?
   Flag any out-of-scope file.
3. SECURITY — secrets in commits, injection vectors, auth gaps,
   privilege boundaries.
4. PERFORMANCE — obvious perf issues (N+1, blocking IO without
   timeout, unbounded memory).
5. TESTS — does the diff ship with tests for new behaviors? Do
   the tests assert behavior, not execution?
6. TASTE — naming, structure, comments, future-maintainer cost.

Output format:
- BLOCKERS (merge-blocking): numbered list with file:line
- CONCERNS (needs response before merge): numbered list
- NITS (optional): numbered list
- Out-of-scope check: pass/fail
- Memory invariant check: any rules in memory that this diff
  conflicts with?

Be specific. "Looks good" is not a review. There are always three
things worth flagging — find them.

Author bias check (run after agent reviews its own code)

The agent that wrote this code also reviewed it. Run a
fresh-eyes review with these biases corrected:

- Re-read the diff WITHOUT the implementation context
- Look for "what would I be confused by if I came to this in 6
  months without the conversation history"
- Look for things the original agent might have justified but
  that don't actually hold up

Output: any new BLOCKERS or CONCERNS the original review missed.
If the original review was sufficient, say so explicitly with
reasoning.

Pre-merge final gate

Final pre-merge gate. Confirm:

1. All BLOCKERS from review are resolved (cite the fix commit)
2. All CONCERNS have a documented response (resolved or
   deferred with tracking issue)
3. Test suite green on the merge target after rebase
4. No new force-push, no history rewrite, no irreversible action
5. Merge note drafted (one paragraph: what shipped, what's
   deferred, known limitations)

Output: GO / NO-GO. If NO-GO, name what's missing.

Do not run the merge. Output the report. I'll run the merge.

Apply this

40-minute exercise. Run a real review. Calibrate against your historical review pattern.

Add agent review to your merge path

Each step takes 5–10 minutes. Progress saves automatically.

0/5

01Pick a recent commit (yours or an agent's). Run the 6-axis review prompt against it.Even on shipped code, the review will surface debt. The point is to calibrate what the prompt finds.
02Compare the review output to what you reviewed for at the time.Most operators reviewed for correctness and style. The 6-axis prompt usually catches scope and memory-invariant violations the human review missed.
03Add a CLAUDE.md line: "All commits to src/ get a 6-axis review by a fresh agent session before merge."The fresh-session rule is what gets you the second pair of eyes when you're solo.
04Set up a pre-merge hook that runs the review prompt automatically.The hook can be as simple as a script that opens a fresh agent session, pastes the diff, and saves the review output to a PR comment or local file.
05Establish escalation rules: BLOCKERS block merge, CONCERNS get a documented response, NITS are optional.Write these rules in CLAUDE.md so the agent loads them on every review session. The structure is what makes the review actionable.

Operating tier · what's next

After this lesson

Operating · № 18● live

Postmortem-on-rails — the template

The 7-section postmortem template + how to make agents write it for you.

20 min read · 45 min apply

Operating · № 19● live

Decision logs in practice

Why decision logs beat status meetings. Format, cadence, retrieval pattern.

16 min read · 30 min apply