Operating · Lesson 02 — Postmortem discipline — when something breaks anyway
O02Operating
Operating · Lesson 02● live

Postmortem discipline

When something breaks anyway, what’s the structural fix.

15 min read · 45 min applyprereq: Operating 01companion bundle →

The 4-day story

On Monday a long-running cloud GPU job got destroyed at hour 4. The launcher process exited unexpectedly — a network blip dropped the SSH heartbeat — and the launcher’s atexit cleanup hook called pod.destroy(). The pod went down with the launcher. Four hours of work, gone.

We wrote the postmortem that afternoon. It was a good postmortem. Structural causes named: pod lifecycle was tied to launcher lifecycle, no threshold for short-vs-long runs, heartbeat retry too aggressive. Action items listed: add auto_destroy=False for long runs, raise heartbeat retry, document the SOP. Owners assigned. Filed. Closed.

Four days later, the exact same thing happened.

Different time of day, slightly different network condition, same class of failure. The pod died with the launcher. Fourteen hours of training, gone.

The first postmortem hadn’t been wrong. The action items hadn’t been wrong. The action items just hadn’t shipped. The fix was scheduled for the following Monday. The launcher we used overnight was still the buggy one.

The lesson from incident #2 was bigger than the lesson from incident #1. The first incident taught us a structural bug. The second incident taught us that postmortem action items have to ship before the next run in the same domain. Otherwise the postmortem is just a writing exercise.

That’s the rule the planning gate now enforces. It’s also why this lesson exists.

Three ways postmortems fail

Most teams fall into one of three patterns when something breaks. Each looks reasonable in the moment. Each is exactly how the same incident class recurs. Hover any card to see the diagnosis.

01

Skip-it-and-move-on

claim looks likeIt's fixed, let's move on. We don't have time for a full writeup.
what’s missingThe lesson never gets extracted. Six weeks later the same class of failure happens again because nothing structural was learned.
the moveSet a 30-minute floor for any incident that cost time, money, trust, or shipped a wrong thing. Time-box the writeup, but don't skip it.
02

Person-blame narrative

claim looks likeBob shouldn't have run the migration without testing.
what’s missingBob is now defensive. Bob's coworkers stop reporting incidents. Useful detail gets removed from the writeup. Six weeks later the same class of failure happens and Bob is gone.
the moveApply the substitution test. If it would have happened with anyone else in Bob's seat, the cause is structural — not Bob.
03

Action-items-on-paper

claim looks likePostmortem filed. Action items listed. Owner assigned. Move on.
what’s missingThe fix never ships. The next incident in the same domain hits — same class, sometimes the same week — because the structural change was scheduled, not delivered.
the moveBlock all new work in a previously-failed domain until the prior postmortem's action items are verified done.

The third pattern (action-items-on-paper) is the most insidious because it looks like discipline. The team wrote the doc; the team listed the items; the team meant well. The discipline is the close-out, not the writeup.

The substitution test

When you’re writing a postmortem and trying to identify causes, paste this prompt:

The substitution test
Walk every step that contributed to this incident.
For each step, ask: would this step have failed if a different
person was at the keyboard? The steps that say YES are structural
causes. The steps that say NO are personal — and the next person
in that role will trip the same wire.

Output a numbered list of structural causes only. Skip the personal ones.

Separates structural causes from personal blame in 30 seconds.

The test is binary and fast. For every step that contributed to the incident, ask: would the step have failed with someone else at the keyboard?

  • YES → structural. The system allowed the failure. The next person in that role will trip the same wire. Fix the system.
  • NO → personal. A specific person made a specific choice that wouldn’t have happened otherwise. This is rare in real incidents and extremely common in narratives that blame.

When agentic AI is involved — which is most of your work now — the structural framing becomes even more correct. There is nothing morally meaningful about “blaming Claude.” The agent did exactly what was asked. What was asked was the wrong thing. That’s a system gap. Fix the system.

The structural causes you find this way always live in one of these layers:

  • The agent’s prompt or rules file (CLAUDE.md, soul file)
  • A pre/post-tool-use hook that wasn’t firing
  • A pre-flight check that loaded the wrong context
  • A test that didn’t cover the failure mode
  • A sprint contract criterion that wasn’t specific enough
  • A planning gate that didn’t enforce a prior lesson

Action items that target one of those layers can ship and stay shipped. Action items that target “the agent should remember to” cannot.

Annotated transcript (real, anonymized)

The conversation that produced the structural fix after incident #2. Click any pinned annotation under an agent reply to expand the “why this turn matters” commentary.

Real session, INC-031 morning — anonymized

Operator + agent diagnosing a recurring failure. The third structural cause turned out to be the postmortem process itself.

  1. operator

    I left a long training run going overnight. Just checked — the GPU pod is gone. The training never finished.

  2. agent

    Confirmed. The pod was destroyed at 18:14, about 4 hours into the planned 14-hour run. Looking at the launcher logs: a network blip dropped the SSH heartbeat, the launcher's `atexit` cleanup hook fired, and the hook called `pod.destroy()`. The pod went down with the launcher.

  3. operator

    We had this exact thing happen four days ago. INC-030. Didn't we postmortem it?

  4. agent

    Yes — postmortem was filed on April 21. Action items listed: "add `auto_destroy=False` flag for runs > 240 minutes" and "raise heartbeat retry threshold for long runs." Both items had owners but no verified-shipped status. They were scheduled, not delivered.

  5. operator

    OK. Walk me through what the postmortem for today should say. Don't rehash INC-030 — what did INC-030 miss?

  6. agent

    Three structural causes are visible. First: pod lifecycle is tied to launcher lifecycle. The atexit hook treats launcher death as "abort the work," which is correct for a 5-min canary and catastrophic for a 14-hour train. There's no threshold. Second: heartbeat retry was 3 retries over 90 seconds. A real network blip can last minutes. The launcher decided the pod was dead when only the connection was disrupted. Third — and this is the new one — INC-030's action items existed but the planning gate didn't enforce that they shipped before the next run. That third one means the discipline itself failed. The first two are bugs in the launcher; the third is a gap in how postmortems get closed.

  7. operator

    Categorize each action item. Rule, hook, test, contract criterion, memory entry, or other.

  8. agent

    AI-1: explicit `auto_destroy` parameter on launcher with 240-min threshold. → **Rule** + code change. The rule is documented in the cloud-training SOP; the code change implements it. AI-2: heartbeat retry tuned per run length. → **Rule** + code change. Same shape. AI-3: independent watchdog process for runs > 4 hours. → **Hook** (a separate cron-like process). Not the same as the launcher; cannot die when launcher dies. AI-4: Helm refuses new work in a previously-failed domain until prior postmortem action items are verified done. → **Rule** + planning-gate enforcement. This is the new one. It's the structural fix that prevents INC-031 from recurring. AI-5: postmortem template gets a "close-out verification" section that requires evidence (commit hash, test name, hook filename) for each action item. → **Schema change** to the template.

  9. operator

    Owner sentiment? I want this in the writeup.

  10. agent

    From the chat earlier today, lightly cleaned up: "I'm fine with paying for compute. I'm not fine with paying for compute that gets thrown away because the laptop sneezed."

Action item taxonomy

Every action item in a postmortem must be one of five types. If you can’t categorize an action item, it’s too vague to ship — rewrite it specifically enough that a label fits.

TypeLives inExample
Rule.claude/rules/<domain>.md“Pre-checks must be read-only — they cannot mutate config files.”
Hook.claude/hooks/*.shPre-commit hook that blocks any commit touching .camera-mapping.json without human-verified: true frontmatter.
Testtests/test_<thing>.pytests/test_split_integrity.py — verifies zero filename overlap between holdout and training set on every eval.
CriterionSprint contract template + every contract using it“Holdout overlap with training: 0 (verified by tests/test_split_integrity.py)”
Memorymemory/feedback/*.md“F1 > 0.99 on a non-trivial set is suspicious — re-verify the split before treating as success.”

The reason the taxonomy is small is that every type has a verifiable shipping artifact. A rule has a file path. A hook has a commit hash. A test has a name and a passing run. A criterion has a contract row. A memory entry has a last-modified timestamp.

The postmortem is closed when every action item has its artifact. Not when the doc is written. Not when the items are listed. When the artifacts are verifiably in place.

Three diagrams

The postmortem loop (incident → close-out)
Agent reports
"Done. Here's the status."
Apply the diagnostic
"Walk every claim. Test method?"
All claims have
observable test methods
Some claims marked
INCOMPLETE
Agent says it
cannot verify
Run the tests yourself.
Pass or revise.
Block on those criteria.
Re-plan around the gaps.
Revise the criterion or
run the tests yourself.
Where each action-item type lives
"done"
Shape claim
Inherited assumption
Schema-quiet mismatch
"240 files saved at 2304×1296"
"Auto-labels look good per baseline"
"Labels uploaded successfully"
Did anyone open one?
Was the assumption re-verified?
Were the IDs actually right?
Need:
visual inspection step in contract
Need:
explicit re-validation of every inherited claim
Need:
cross-system schema check, not upload-success
The close-out gate (postmortem isn't done until artifacts ship)
Sprint requested
Draft testable success criteria
Builder + Evaluator co-sign. Every row has a "How to test" cell.
Build against contract
Builder self-eval
runs each test method
Evaluator independent eval
fresh eyes, same tests
Ship
Return to Builder

Prompt kit

Three prompts for writing postmortems that produce shipped fixes, not writing exercises. Save them in your CLAUDE.md or a personal snippets file.

The substitution test
Walk every step that contributed to this incident.
For each step, ask: would this step have failed if a different
person was at the keyboard? The steps that say YES are structural
causes. The steps that say NO are personal — and the next person
in that role will trip the same wire.

Output a numbered list of structural causes only. Skip the personal ones.
Categorize every action item
For each action item in the postmortem, label it as one of:
- rule        (documented in a rules file)
- hook        (mechanical pre/post-tool-use enforcement)
- test        (automated test that catches the failure)
- criterion   (sprint contract row that gates the next run)
- memory      (entry that loads at session start)

If you cannot categorize an action item, it's too vague to ship.
Rewrite it specifically enough that one of the labels fits.
Close-out verification
Walk every action item in this postmortem.
For each, give me the evidence that it has shipped:
- Rule  → file path of the rules entry that contains it
- Hook  → file path of the hook script + last commit that added it
- Test  → test name and commit that added it (must be currently passing)
- Criterion → contract template path with the new criterion in it
- Memory → memory file path and last-modified timestamp

Mark any action item with no evidence as INCOMPLETE.
The postmortem is not closed until every action item has evidence.

Apply this to your project

45-minute exercise. Pick a real incident from the last 30 days. Even a small one.

Write your first postmortem

Each step takes 5–15 minutes. Progress saves automatically.

0/5
  1. 01Pick a recent incident — anything that cost time, money, trust, or shipped wrong.Doesn't have to be P0. A 30-minute mistake counts. The discipline is the practice.
  2. 02Run the substitution test on every step that contributed.Would the step have failed with someone else at the keyboard? Yes → structural. No → personal. List only the structural ones.
  3. 03Write action items, one per cause. Categorize each (rule / hook / test / criterion / memory).If you can't categorize an action item, it's too vague. Rewrite it.
  4. 04Add the owner-sentiment section. Use a real quote — yours or whoever paid the cost.Don't soften it. The honest cost belongs in the document.
  5. 05Close-out: verify each action item shipped before any new work in the same domain.This is the rule that prevents the same incident class from recurring four days later.

Reference structure (drop into your project today)

your-project/
├── CLAUDE.md                          ← reference the substitution test prompt here
├── memory/
│   └── postmortems/
│       ├── _TEMPLATE.md               ← from the bundle
│       └── YYYY-MM-DD-<incident>.md   ← one per incident
├── .claude/
│   ├── rules/                         ← rule-type action items live here
│   └── hooks/                         ← hook-type action items live here
├── tests/                             ← test-type action items
└── ...

The blameless postmortem template is in the postmortem-discipline bundle. Four worked examples (camera mapping disaster, data contamination, planning failure, cloud pod loss) show the discipline applied across different incident classes.

Operating tier · what's next

After this lesson

“I’m fine with paying for compute. I’m not fine with paying for compute that gets thrown away because the laptop sneezed.”

— operator quote · INC-031 day-of · anonymized