
Postmortem discipline
When something breaks anyway, what’s the structural fix.
The 4-day story
On Monday a long-running cloud GPU job got destroyed at hour 4. The launcher process exited unexpectedly — a network blip dropped the SSH heartbeat — and the launcher’s atexit cleanup hook called pod.destroy(). The pod went down with the launcher. Four hours of work, gone.
We wrote the postmortem that afternoon. It was a good postmortem. Structural causes named: pod lifecycle was tied to launcher lifecycle, no threshold for short-vs-long runs, heartbeat retry too aggressive. Action items listed: add auto_destroy=False for long runs, raise heartbeat retry, document the SOP. Owners assigned. Filed. Closed.
Four days later, the exact same thing happened.
Different time of day, slightly different network condition, same class of failure. The pod died with the launcher. Fourteen hours of training, gone.
The first postmortem hadn’t been wrong. The action items hadn’t been wrong. The action items just hadn’t shipped. The fix was scheduled for the following Monday. The launcher we used overnight was still the buggy one.
The lesson from incident #2 was bigger than the lesson from incident #1. The first incident taught us a structural bug. The second incident taught us that postmortem action items have to ship before the next run in the same domain. Otherwise the postmortem is just a writing exercise.
That’s the rule the planning gate now enforces. It’s also why this lesson exists.
Three ways postmortems fail
Most teams fall into one of three patterns when something breaks. Each looks reasonable in the moment. Each is exactly how the same incident class recurs. Hover any card to see the diagnosis.
Skip-it-and-move-on
Person-blame narrative
Action-items-on-paper
The third pattern (action-items-on-paper) is the most insidious because it looks like discipline. The team wrote the doc; the team listed the items; the team meant well. The discipline is the close-out, not the writeup.
The substitution test
When you’re writing a postmortem and trying to identify causes, paste this prompt:
Walk every step that contributed to this incident.
For each step, ask: would this step have failed if a different
person was at the keyboard? The steps that say YES are structural
causes. The steps that say NO are personal — and the next person
in that role will trip the same wire.
Output a numbered list of structural causes only. Skip the personal ones.Separates structural causes from personal blame in 30 seconds.
The test is binary and fast. For every step that contributed to the incident, ask: would the step have failed with someone else at the keyboard?
- YES → structural. The system allowed the failure. The next person in that role will trip the same wire. Fix the system.
- NO → personal. A specific person made a specific choice that wouldn’t have happened otherwise. This is rare in real incidents and extremely common in narratives that blame.
When agentic AI is involved — which is most of your work now — the structural framing becomes even more correct. There is nothing morally meaningful about “blaming Claude.” The agent did exactly what was asked. What was asked was the wrong thing. That’s a system gap. Fix the system.
The structural causes you find this way always live in one of these layers:
- The agent’s prompt or rules file (CLAUDE.md, soul file)
- A pre/post-tool-use hook that wasn’t firing
- A pre-flight check that loaded the wrong context
- A test that didn’t cover the failure mode
- A sprint contract criterion that wasn’t specific enough
- A planning gate that didn’t enforce a prior lesson
Action items that target one of those layers can ship and stay shipped. Action items that target “the agent should remember to” cannot.
Annotated transcript (real, anonymized)
The conversation that produced the structural fix after incident #2. Click any pinned annotation under an agent reply to expand the “why this turn matters” commentary.
Operator + agent diagnosing a recurring failure. The third structural cause turned out to be the postmortem process itself.
- operator
I left a long training run going overnight. Just checked — the GPU pod is gone. The training never finished.
- agent
Confirmed. The pod was destroyed at 18:14, about 4 hours into the planned 14-hour run. Looking at the launcher logs: a network blip dropped the SSH heartbeat, the launcher's `atexit` cleanup hook fired, and the hook called `pod.destroy()`. The pod went down with the launcher.
- operator
We had this exact thing happen four days ago. INC-030. Didn't we postmortem it?
- agent
Yes — postmortem was filed on April 21. Action items listed: "add `auto_destroy=False` flag for runs > 240 minutes" and "raise heartbeat retry threshold for long runs." Both items had owners but no verified-shipped status. They were scheduled, not delivered.
- operator
OK. Walk me through what the postmortem for today should say. Don't rehash INC-030 — what did INC-030 miss?
- agent
Three structural causes are visible. First: pod lifecycle is tied to launcher lifecycle. The atexit hook treats launcher death as "abort the work," which is correct for a 5-min canary and catastrophic for a 14-hour train. There's no threshold. Second: heartbeat retry was 3 retries over 90 seconds. A real network blip can last minutes. The launcher decided the pod was dead when only the connection was disrupted. Third — and this is the new one — INC-030's action items existed but the planning gate didn't enforce that they shipped before the next run. That third one means the discipline itself failed. The first two are bugs in the launcher; the third is a gap in how postmortems get closed.
- operator
Categorize each action item. Rule, hook, test, contract criterion, memory entry, or other.
- agent
AI-1: explicit `auto_destroy` parameter on launcher with 240-min threshold. → **Rule** + code change. The rule is documented in the cloud-training SOP; the code change implements it. AI-2: heartbeat retry tuned per run length. → **Rule** + code change. Same shape. AI-3: independent watchdog process for runs > 4 hours. → **Hook** (a separate cron-like process). Not the same as the launcher; cannot die when launcher dies. AI-4: Helm refuses new work in a previously-failed domain until prior postmortem action items are verified done. → **Rule** + planning-gate enforcement. This is the new one. It's the structural fix that prevents INC-031 from recurring. AI-5: postmortem template gets a "close-out verification" section that requires evidence (commit hash, test name, hook filename) for each action item. → **Schema change** to the template.
- operator
Owner sentiment? I want this in the writeup.
- agent
From the chat earlier today, lightly cleaned up: "I'm fine with paying for compute. I'm not fine with paying for compute that gets thrown away because the laptop sneezed."
Action item taxonomy
Every action item in a postmortem must be one of five types. If you can’t categorize an action item, it’s too vague to ship — rewrite it specifically enough that a label fits.
| Type | Lives in | Example |
|---|---|---|
| Rule | .claude/rules/<domain>.md | “Pre-checks must be read-only — they cannot mutate config files.” |
| Hook | .claude/hooks/*.sh | Pre-commit hook that blocks any commit touching .camera-mapping.json without human-verified: true frontmatter. |
| Test | tests/test_<thing>.py | tests/test_split_integrity.py — verifies zero filename overlap between holdout and training set on every eval. |
| Criterion | Sprint contract template + every contract using it | “Holdout overlap with training: 0 (verified by tests/test_split_integrity.py)” |
| Memory | memory/feedback/*.md | “F1 > 0.99 on a non-trivial set is suspicious — re-verify the split before treating as success.” |
The reason the taxonomy is small is that every type has a verifiable shipping artifact. A rule has a file path. A hook has a commit hash. A test has a name and a passing run. A criterion has a contract row. A memory entry has a last-modified timestamp.
The postmortem is closed when every action item has its artifact. Not when the doc is written. Not when the items are listed. When the artifacts are verifiably in place.
Three diagrams
Prompt kit
Three prompts for writing postmortems that produce shipped fixes, not writing exercises. Save them in your CLAUDE.md or a personal snippets file.
Walk every step that contributed to this incident.
For each step, ask: would this step have failed if a different
person was at the keyboard? The steps that say YES are structural
causes. The steps that say NO are personal — and the next person
in that role will trip the same wire.
Output a numbered list of structural causes only. Skip the personal ones.For each action item in the postmortem, label it as one of:
- rule (documented in a rules file)
- hook (mechanical pre/post-tool-use enforcement)
- test (automated test that catches the failure)
- criterion (sprint contract row that gates the next run)
- memory (entry that loads at session start)
If you cannot categorize an action item, it's too vague to ship.
Rewrite it specifically enough that one of the labels fits.Walk every action item in this postmortem.
For each, give me the evidence that it has shipped:
- Rule → file path of the rules entry that contains it
- Hook → file path of the hook script + last commit that added it
- Test → test name and commit that added it (must be currently passing)
- Criterion → contract template path with the new criterion in it
- Memory → memory file path and last-modified timestamp
Mark any action item with no evidence as INCOMPLETE.
The postmortem is not closed until every action item has evidence.Apply this to your project
45-minute exercise. Pick a real incident from the last 30 days. Even a small one.
Write your first postmortem
Each step takes 5–15 minutes. Progress saves automatically.
- 01Pick a recent incident — anything that cost time, money, trust, or shipped wrong.Doesn't have to be P0. A 30-minute mistake counts. The discipline is the practice.
- 02Run the substitution test on every step that contributed.Would the step have failed with someone else at the keyboard? Yes → structural. No → personal. List only the structural ones.
- 03Write action items, one per cause. Categorize each (rule / hook / test / criterion / memory).If you can't categorize an action item, it's too vague. Rewrite it.
- 04Add the owner-sentiment section. Use a real quote — yours or whoever paid the cost.Don't soften it. The honest cost belongs in the document.
- 05Close-out: verify each action item shipped before any new work in the same domain.This is the rule that prevents the same incident class from recurring four days later.
Reference structure (drop into your project today)
your-project/
├── CLAUDE.md ← reference the substitution test prompt here
├── memory/
│ └── postmortems/
│ ├── _TEMPLATE.md ← from the bundle
│ └── YYYY-MM-DD-<incident>.md ← one per incident
├── .claude/
│ ├── rules/ ← rule-type action items live here
│ └── hooks/ ← hook-type action items live here
├── tests/ ← test-type action items
└── ...The blameless postmortem template is in the postmortem-discipline bundle. Four worked examples (camera mapping disaster, data contamination, planning failure, cloud pod loss) show the discipline applied across different incident classes.