
Postmortem on rails
The 7-section template + the prompt kit that fills it.
On rails — what that means
Operating 02 (Postmortem discipline) covered the philosophy: substitution test, action item taxonomy, close-out protocol. This lesson is the operational follow-up — the actual drop-in template plus the prompt that fills it.
On rails means the template structurally enforces the discipline. You don’t have to remember to apply the substitution test — section 2 forces it. You don’t have to remember to add owner sentiment — section 4 demands a quote. You don’t have to remember to verify action items shipped — section 7 blocks new work until they do.
The discipline is in the structure, not in the operator’s memory. That’s the whole point. INC-030 → INC-031 happened because the discipline was on the operator (and the operator’s memory failed). On-rails means the next operator (or the next agent) gets the same discipline by filling the same sections.
The 7-section template — drop this in
Save at memory/postmortems/_TEMPLATE.md. Every new postmortem copies it.
# INC-XXX — <Incident Title>
**Date:** YYYY-MM-DD
**Severity:** P0 / P1 / P2
**Status:** Open / Closed pending close-out / Closed
## 1. What Happened
Timeline of events with timestamps. Facts only.
- HH:MM — event
- HH:MM — event
- HH:MM — postmortem started
## 2. Structural Causes
Apply the substitution test (Operating 02). For each step that
contributed: would it have failed with a different person at the
keyboard? Yes = structural. List only structural causes.
- 2a. <Cause> — STRUCTURAL because <reasoning>
- 2b. <Cause> — STRUCTURAL because <reasoning>
## 3. What We Got Right
Process steps that worked. Brief.
- <Item>
- <Item>
## 4. Owner Sentiment
> "<Quote from whoever paid the cost. Real language. Don't soften.>"
— <name or role>, <when>, <anonymized if needed>
## 5. Action Items
Each categorized: rule / hook / test / criterion / memory.
Each has a file path, an owner, a due date.
- AI-1. <Action> → <CATEGORY>. File: <path>. Owner: <name>. Due: <date>.
- AI-2. <Action> → <CATEGORY>. File: <path>. Owner: <name>. Due: <date>.
## 6. Lessons
One-liners for memory. Generalizable to other incidents in this
class.
- L1: <Lesson>
- L2: <Lesson>
## 7. Close-out Verification
The postmortem is NOT closed until every AI has shipped artifact
evidence. Mark all as INCOMPLETE initially. Re-walk and verify
before declaring closed.
- AI-1: <evidence required — commit hash + test name>. Status: INCOMPLETE
- AI-2: <evidence required>. Status: INCOMPLETE
**Gate:** New work in the <domain> domain is BLOCKED until all AIs verify.
Seven sections. Each forces a discipline. Skip a section and the agent (or you) had to actively skip it — the structure is what prevents passive drift.
Five template failures
The narrative postmortem
The blame postmortem
The action-items-on-paper postmortem
The postmortem with no owner sentiment
The postmortem the agent doesn't read
Pattern #5 is the meta-failure. A template that doesn’t load is a template that doesn’t change behavior. The MEMORY.md index entry is what makes the postmortem load-bearing across sessions.
The fill-the-template prompt
Run this when an incident occurs. Agent reads the template, fills each section, refuses to skip any. Output is a complete postmortem ready for human review.
Read the template at memory/postmortems/_TEMPLATE.md.
Fill it for this incident: <describe>
For each section:
1. WHAT HAPPENED — timeline of events with timestamps. Just facts.
2. STRUCTURAL CAUSES — apply the substitution test. List only
structural causes. Personal/individual causes don't go here.
3. WHAT WE GOT RIGHT — process steps that worked. Brief.
4. OWNER SENTIMENT — quote from whoever paid the cost. Real
language, not softened.
5. ACTION ITEMS — each categorized as rule / hook / test /
criterion / memory. Each has a file path and an owner.
6. LESSONS — one-liners for memory. Generalizable to other
incidents in the same class.
7. CLOSE-OUT VERIFICATION — list each action item with the
evidence required for it to be considered shipped (commit
hash, file path, test name). Mark all as INCOMPLETE initially.
Output the filled template. Do not skip sections.The structure forces the discipline. Agent fills sections 1-7; human reviews and signs off section 4 (sentiment) + section 7 (close-out gate).
The prompt explicitly forbids skipping sections. The most common skip is section 4 (owner sentiment) — agents are uncomfortable writing as the operator. Prompt them: paste the actual quote from chat, anonymize if needed, do not paraphrase.
INC-031 walkthrough
Real incident: second cloud GPU pod loss in four days. The postmortem is filled on rails, the close-out gate stays closed until 5/5 action items verify, then the gate flips and new work resumes.
Five action items. Three categories (rule + hook + criterion). Close-out gate gates new GPU work until verified.
- operator
INC-031 — second cloud GPU pod loss in four days. Same root cause class as INC-030. Write the postmortem on rails using the template.
- agent
Reading INC-030 first to understand what was promised but didn't ship. Loading the template at
memory/postmortems/_TEMPLATE.md. Filling section by section.# INC-031 — Cloud GPU Pod Loss (recurring) **Date:** 2026-MM-DD **Severity:** P1 (4-hour productive run lost) **Status:** Closed pending action item ship ## 1. What Happened (timeline) - 14:00 — Long training run launched (planned 14h) - 18:14 — SSH heartbeat dropped; launcher's atexit cleanup fired; pod destroyed - 18:20 — Operator notices job gone - 18:35 — Postmortem started (this document) ## 2. Structural Causes (substitution test applied) Walking each step. Any step that would have failed with someone else at the keyboard is structural. 2a. Pod lifecycle tied to launcher lifecycle. Anyone using this launcher loses pods on disconnect. STRUCTURAL. 2b. Heartbeat retry was 3 retries / 90 sec. Real network blips can last minutes. Anyone running this launcher has the same window. STRUCTURAL. 2c. INC-030 action items existed but planning gate did not enforce that they shipped before the next run. The gate process itself failed. STRUCTURAL META-CAUSE. ## 3. What We Got Right - Postmortem started within 30 min of incident - INC-030 was correctly diagnosed (action items were sound) - Substitution test applied without finger-pointing ## 4. Owner Sentiment > "I'm fine with paying for compute. I'm not fine with paying for compute that gets thrown away because the laptop sneezed." — operator, day-of, anonymized ## 5. Action Items Each categorized (rule / hook / test / criterion / memory). AI-1. Add `auto_destroy=False` parameter to launcher with 240-min threshold. → RULE + code change. File: `cloud/launcher.py`. Owner: Mike. Due: this week. AI-2. Heartbeat retry tuned per run length (5 retries / 5 min for runs > 4h). → RULE + code change. Same file. Same owner. AI-3. Independent watchdog process for runs > 4 hours. → HOOK. New file: `cloud/watchdog.sh`. Owner: Mike. AI-4. Helm planning gate refuses new GPU work in the cloud-training domain until prior postmortem AIs are verified done. → RULE + planning-gate enforcement. File: `.claude/rules/cloud-training.md`. Owner: Mike. AI-5. Postmortem template gets a "close-out verification" section that requires evidence (commit hash, hook filename, test name) for each AI. → SCHEMA. File: `memory/postmortems/_TEMPLATE.md`. Owner: Mike. ## 6. Lessons (one-liners for memory) - L1: Postmortem AIs must ship before the next run in a previously-failed domain. Otherwise the postmortem is a writing exercise. - L2: Pod lifecycle != launcher lifecycle for runs > a threshold. Decouple explicitly. ## 7. Close-out Verification The postmortem is NOT closed until each AI has artifact evidence: - AI-1: commit hash + test name (currently passing) - AI-2: commit hash + test name - AI-3: hook filename + last commit that added it - AI-4: rules file path + planning-gate commit - AI-5: template file path + last-modified timestamp Status: 0/5 verified. NEW WORK IN cloud-training DOMAIN BLOCKED until 5/5.This is the postmortem. The close-out gate at section 7 is what makes it on-rails.
- operator
Two days later. AI-1 and AI-3 shipped per commits. AI-2, AI-4, AI-5 still open. Can I launch a new GPU run?
- agent
Walking the close-out gate at section 7. Status: 2/5 verified.
AI-1: commit
abc123addsauto_destroyparam. Testtest_auto_destroy_thresholdpassing. ✅
AI-3:cloud/watchdog.shexists, last commitdef456. ✅
AI-2: heartbeat retry tuning — NOT shipped. Code still uses 3 retries / 90 sec.
AI-4: planning gate rule — NOT shipped. No.claude/rules/cloud-training.mdexists.
AI-5: template close-out section — NOT shipped. Current template atmemory/postmortems/_TEMPLATE.mddoes not contain section 7.The gate says: NO. Three of five action items have not shipped. New work in cloud-training is blocked until 5/5.
The correct next step is to ship AI-2, AI-4, AI-5 — not to launch a new run that would re-expose the same failure surface.
- operator
Ship AI-2, AI-4, AI-5. When all five verify, close the postmortem.
- agent
Done:
- AI-2 commitghi789— heartbeat retry now 5 retries / 5 min for runs > 4h. Testtest_heartbeat_long_runpassing.
- AI-4 commitjkl012—.claude/rules/cloud-training.mdadds the planning-gate clause. Helm reads on session start.
- AI-5 commitmno345— template now contains section 7 close-out. Future postmortems inherit the structure.Close-out gate: 5/5 verified. INC-031 closed.
Memory entry added to
memory/MEMORY.mdindex pointing to INC-031 lessons L1 + L2. Loads on every session in the cloud-training domain.New GPU work is now unblocked.
Prompt kit
Three prompts: fill the template, close-out verification, weekly index walk. Save in your CLAUDE.md.
Read the template at memory/postmortems/_TEMPLATE.md.
Fill it for this incident: <describe>
For each section:
1. WHAT HAPPENED — timeline of events with timestamps. Just facts.
2. STRUCTURAL CAUSES — apply the substitution test. List only
structural causes. Personal/individual causes don't go here.
3. WHAT WE GOT RIGHT — process steps that worked. Brief.
4. OWNER SENTIMENT — quote from whoever paid the cost. Real
language, not softened.
5. ACTION ITEMS — each categorized as rule / hook / test /
criterion / memory. Each has a file path and an owner.
6. LESSONS — one-liners for memory. Generalizable to other
incidents in the same class.
7. CLOSE-OUT VERIFICATION — list each action item with the
evidence required for it to be considered shipped (commit
hash, file path, test name). Mark all as INCOMPLETE initially.
Output the filled template. Do not skip sections.Read postmortem at <path>. Walk section 7 (Close-out
Verification). For each action item:
- Find the artifact (commit, file, test)
- Confirm it exists at the path named
- Confirm tests pass / hooks fire / rules load
- Mark VERIFIED or INCOMPLETE
If any action item is INCOMPLETE, the close-out gate stays
closed. New work in the failed domain remains blocked.
Output: status of each AI + gate state (CLOSED / OPEN).Walk every postmortem in memory/postmortems/. For each:
- Title + incident date
- Status (open / closed)
- Action items shipped / total
- Days since last close-out check
Flag any open postmortem older than 30 days — those are zombie
postmortems where action items never shipped.
Flag any closed postmortem whose lessons (section 6) are not
referenced from MEMORY.md — those lessons aren't loading on
future sessions.
Output: triaged list, top 3 to act on this week.Apply this
45-minute exercise. Drop in the template. Write one postmortem on rails. Verify the close-out gate.
Put postmortems on rails
Each step takes 5–15 minutes. Progress saves automatically.
- 01Drop the 7-section template into memory/postmortems/_TEMPLATE.md.Use the structure from this lesson: timeline, structural causes, what-we-got-right, owner-sentiment, action items, lessons, close-out verification.
- 02Write one postmortem from a recent incident using the template.Run the fill-the-template prompt. Don't skip sections — the discipline is the structure.
- 03Verify each action item with a real artifact (commit hash, test name, file path).Section 7 is the gate. Every AI needs evidence. "Will do tomorrow" is not evidence.
- 04Add a planning-gate rule: no new work in the failed domain until close-out is 100%.This is the rule that prevents INC-030 → INC-031. The gate has teeth because the agent reads it on every relevant session.
- 05Index lessons (section 6) from MEMORY.md so they load on every session.Lessons that don't load don't change behavior. The memory index is what makes the postmortem load-bearing across sessions.