Foundations · Lesson 14 — Reading a diff before you accept it

F14Foundations

Foundations · Lesson 14● live

Reading a diff before you accept it

The 4-pass review that catches what tests miss.

12 min read · ongoing applyno prereq · cross-domain

Why diff review beats trust

When an agent produces a diff and tests pass, the easy move is to ship. Tests are green; agent says done; you move on. That habit is wrong about 20-30% of the time, and the 20-30% includes most of the bugs that bite in production.

Tests cover what somebody thought to test. They don’t cover the global log-level that just got set to DEBUG, the fixed buffer that doesn’t bound-check, the watchdog that didn’t get fed, the dead event that nobody consumes. Those bugs survive the test suite intact. They surface in production, where the cost of debug is 100x what it would have been at review.

Diff review isn’t a code-quality concern; it’s a cost-of-rework concern. Every minute spent on the 4-pass review is buying you 10-100 minutes of avoided field debug.

The QC ESP32 firmware diff in this lesson’s transcript is real. It passed all 24 tests. The 4-pass review found 6 issues, 3 of which would have hard-faulted boards in the field. Reading the diff took 7 minutes.

The 4-pass review

Four passes, each looking for something different. Run them in order — earlier passes scope the later ones.

Pass 1 — File list. Does the diff touch the files you expected? Anything unexpectedly missing (a Kconfig entry, a migration, a config file)? Anything unexpectedly present (a vendored dependency, a config-file mutation that looks unrelated)? This pass is mechanical and takes 30 seconds.
Pass 2 — Control flow. For each changed function, walk entry → body → exit. Are error returns checked? Are loops bounded? Are watchdogs fed in tight loops? Are exceptions either handled or propagated, not swallowed? This is where most logic bugs live.
Pass 3 — Side effects. What changes in the world? Files written, events emitted, global state set, network calls, flash writes. Are emitted events actually consumed? Are global settings (log level, timezone, locale) being changed quietly? Side effects are where production surprises live.
Pass 4 — Security. New secrets, credentials, env reads. New external endpoints. Auth boundaries crossed. Input bounds and injection points. This pass is fast on most diffs but catches the highest-severity findings when it does catch.

The 4-pass review takes 5-10 minutes on a normal diff. On firmware, ML training pipelines, or schema migrations, it can take longer. It’s the cheapest defense against shipping the agent’s confidence as your own.

Three diff-review failures

№ 01

Tests-pass-so-ship

claim looks like“"All tests pass — merging."”

what’s missingTests cover what someone thought to test. The diff might add a sleep(2) in a hot loop, swallow an exception, or hardcode a credential. None of those fail tests; all of them fail in production.

the moveTests are necessary, not sufficient. Read the diff with the 4-pass review even when CI is green. Tests catch broken; review catches wrong.

№ 02

The fly-by review

claim looks like“Operator opens the diff, scrolls to the end in 8 seconds, types 'LGTM.'”

what’s missingPattern recognition at speed misses subtle changes — flipped boolean, swapped equality operator, dropped error path. The eye glides past code that looks similar to what was already there.

the moveForce a structural read. The 4-pass review (file list → control flow → side effects → security) takes 5-10 minutes on a normal diff and catches what scrolling misses.

№ 03

Approving a diff outside your zone

claim looks like“"I don't fully understand the firmware part, but the agent says it's fine."”

what’s missingThe diff has unreviewed sections. The agent might be right; might also be wrong. Approving without comprehension is shipping the agent's confidence as your own.

the moveEither learn enough to read it or split the review. "I'll approve the React side; route the firmware diff to AXIOM." Don't laundering-approve through habit.

The pattern across all three: review is structural work, not vibes work. Skimming for “does this look right” misses the bugs that look exactly like what was already there.

The “what changed” prompt

The 4-pass review as a paste-able prompt:

The 4-pass review — paste with any agent-produced diff

Run a 4-pass review of this diff. For each pass,
list findings with severity (BLOCK / FLAG / NIT).

**Pass 1 — File list**
- Files changed (count + paths)
- Match against the plan / intent of this change?
- Anything unexpectedly missing or unexpectedly present?

**Pass 2 — Control flow**
- For each new or changed function: entry → body → exit
- Are error returns checked? Are loops bounded? Are watchdogs fed?
- Any swallowed exceptions or silently-ignored return values?

**Pass 3 — Side effects**
- What changes in the world? (files written, events emitted, state mutated)
- Are emitted events consumed? Are global settings changed?
- Network, flash, db writes — anything new?

**Pass 4 — Security**
- New secrets, credentials, env reads?
- New external endpoints, auth boundaries crossed?
- Buffer / bounds / injection issues — any unchecked input sizes?

Output as a markdown report. Final line: BLOCK / APPROVE-WITH-NITS / APPROVE.

Forces structural review. Output is a labeled findings report, not a vibes summary.

The trick is the BLOCK / FLAG / NIT severity labels. The agent has to commit to a final-line verdict. That commitment surfaces issues that a vibes-y summary would have hidden in qualifications.

QC ESP32 firmware diff — 4-pass review

A real diff: 161 lines of new sensor stubs for the QC radar/ToF sensor strategy reset. All 24 tests passed. The 4-pass review surfaced 6 issues, 3 blocking.

QC firmware diff: tests green, 4-pass surfaces 6 issues

Pass 1 caught a missing Kconfig. Pass 2 caught uncalibrated streaming and a missing watchdog feed. Pass 3 caught a global log-level mutation. Pass 4 caught a buffer overflow.

operator
Show me the QC ESP32 firmware diff for the radar/ToF sensor stub.
agent
Diff summary:
- 3 files changed: firmware/sensor_radar.c (+82), firmware/sensor_tof.c (+71), firmware/event_bus.h (+8)
- 161 lines added, 0 removed
- New: AcconeerA121 init, ST_VL53L8CX init, two new event types on the bus
- All tests pass: 24/24 (no new tests added)
operator
4-pass review. File list first.
agent
Pass 1 — File list:
- firmware/sensor_radar.c (NEW)
- firmware/sensor_tof.c (NEW)
- firmware/event_bus.h (modified)
Question: Does this match plan? Plan said 3 files. ✓
Question: Any file unexpectedly missing? Plan also called for a Kconfig entry to gate the new sensors behind a feature flag. NOT present in this diff. Flag #1.
operator
Continue. Pass 2.
agent
Pass 2 — Control flow:
In sensor_radar.c: init() → calibration() → start_streaming() → emit events on detection.
- ✓ Returns error codes on init failure
- ⚠️ start_streaming() does NOT check return value of calibration(). If calibration fails, streaming starts on uncalibrated sensor. Flag #2.
- ⚠️ The detection loop has a while(1) with no exit condition or watchdog feed. On stall, ESP32 hard-faults. Flag #3.
In sensor_tof.c: similar shape, similar issues. Same control-flow bugs in both files — copy-paste from a template that was already wrong.
operator
Pass 3. Side effects.
agent
Pass 3 — Side effects:
- New event types added to event_bus.h: SENSOR_RADAR_DETECT, SENSOR_TOF_DETECT
- ⚠️ These events are NOT consumed anywhere in the diff. They emit into a bus with no subscribers — so they're not wrong, just dead. Flag #4.
- ⚠️ sensor_radar.c calls esp_log_level_set("radar", ESP_LOG_DEBUG) at init — sets a global log verbosity to DEBUG, which floods serial in production builds. Flag #5.
- No external network calls, no flash writes, no I2C address conflicts.
The state mutation that's actually concerning: the global log level. That's the kind of thing nobody catches in code review and nobody catches in tests.
operator
Pass 4. Security.
agent
Pass 4 — Security:
- No new credentials or secrets in the diff
- No new external HTTP/HTTPS endpoints
- ⚠️ sensor_tof.c has a fixed-size buffer uint8_t buf[64] and reads up to N bytes from I2C where N comes from the device. If the device returns N>64, buffer overflow. Flag #6 — input bounds-check missing.
- No new auth boundaries crossed.
Final flag count: 6. Recommendation: do not merge as-is. Three of the six (uncalibrated streaming, no watchdog, buffer overflow) are blocking. The other three (missing Kconfig, dead events, global log level) are catch-it-now-or-eat-it-later.

Prompt kit

Three prompts: the 4-pass review, function-level deltas, and the “what tests didn’t catch” challenge.

The 4-pass diff review

Run a 4-pass review of this diff. For each pass,
list findings with severity (BLOCK / FLAG / NIT).

**Pass 1 — File list**
- Files changed (count + paths)
- Match against the plan / intent of this change?
- Anything unexpectedly missing or unexpectedly present?

**Pass 2 — Control flow**
- For each new or changed function: entry → body → exit
- Are error returns checked? Are loops bounded? Are watchdogs fed?
- Any swallowed exceptions or silently-ignored return values?

**Pass 3 — Side effects**
- What changes in the world? (files written, events emitted, state mutated)
- Are emitted events consumed? Are global settings changed?
- Network, flash, db writes — anything new?

**Pass 4 — Security**
- New secrets, credentials, env reads?
- New external endpoints, auth boundaries crossed?
- Buffer / bounds / injection issues — any unchecked input sizes?

Output as a markdown report. Final line: BLOCK / APPROVE-WITH-NITS / APPROVE.

What changed at the function level

For the diff at <branch> vs main, walk every changed function and
output:

| function | file | change-type (added/modified/deleted) | one-line behavior delta |

A behavior delta describes what the function does NOW that's different
from what it did BEFORE. If the change is purely cosmetic (rename,
formatting), say "no behavior change."

The 'what tests didn't catch' challenge

For the changes in this diff, list 5 plausible bugs that:
- The tests would NOT catch (because the test mocks them out, doesn't
  exercise the path, or doesn't assert the right thing)
- Could plausibly happen in production
- Would be expensive to debug after deploy

For each, suggest the smallest test or assertion that would catch it.

Apply this — install the 4-pass habit

Ongoing exercise. Habit installs in ~10 reviews. Aim for at least 5 this week.

Install the 4-pass diff review reflex

Each step takes 5-10 minutes per diff. Progress saves automatically.

0/4

01On your next agent-produced diff, run the 4-pass review prompt instead of skimming.Even if it's a small diff. Especially if it's a small diff — those are where rushed reviews miss the most.
02Catch at least one issue you'd otherwise have approved.You will. Document the catch in your decision log so the pattern doesn't reset.
03Add a CLAUDE.md rule: 'No diff merges without a 4-pass review on changes >50 lines.'The rule is to install the discipline. The exemption (small diffs) is honest — small diffs need the speed.
04On any diff outside your domain, route to a specialist agent (AXIOM for QC firmware, etc.) instead of laundering-approving.Don't approve what you can't read. Route or learn.

Foundations tier · what's next

After this lesson

Foundations · № 15● live

Your first MCP server

The 20-minute setup, the gotchas, when MCP beats writing a tool.

15 min read · 20 min apply

Foundations · № 16● live

Your first custom agent file

Build CIPHER (legal/IP) from scratch in one sitting.

15 min read · 45 min apply

Foundations · № 17● live

Tool permissions and you

Allow vs ask vs deny — the model that prevents 90% of "wait I didn't mean that" moments.

12 min read · 20 min apply