Operating · Lesson 18 — Postmortem-on-rails — the template

O18Operating

Operating · Lesson 18● live

Postmortem on rails

The 7-section template + the prompt kit that fills it.

20 min read · 45 min applyprereq: Operating 02 (postmortem discipline)

On rails — what that means

Operating 02 (Postmortem discipline) covered the philosophy: substitution test, action item taxonomy, close-out protocol. This lesson is the operational follow-up — the actual drop-in template plus the prompt that fills it.

On rails means the template structurally enforces the discipline. You don’t have to remember to apply the substitution test. Section 2 forces it. You don’t have to remember to add owner sentiment. Section 4 demands a quote. You don’t have to remember to verify action items shipped. Section 7 blocks new work until they do.

The discipline is in the structure, not in the operator’s memory. That’s the whole point. INC-030 → INC-031 happened because the discipline was on the operator (and the operator’s memory failed). On-rails means the next operator (or the next agent) gets the same discipline by filling the same sections.

The 7-section template — drop this in

Save at memory/postmortems/_TEMPLATE.md. Every new postmortem copies it.

# INC-XXX — <Incident Title>
**Date:** YYYY-MM-DD
**Severity:** P0 / P1 / P2
**Status:** Open / Closed pending close-out / Closed

## 1. What Happened
Timeline of events with timestamps. Facts only.

- HH:MM — event
- HH:MM — event
- HH:MM — postmortem started

## 2. Structural Causes
Apply the substitution test (Operating 02). For each step that
contributed: would it have failed with a different person at the
keyboard? Yes = structural. List only structural causes.

- 2a. <Cause> — STRUCTURAL because <reasoning>
- 2b. <Cause> — STRUCTURAL because <reasoning>

## 3. What We Got Right
Process steps that worked. Brief.

- <Item>
- <Item>

## 4. Owner Sentiment
> "<Quote from whoever paid the cost. Real language. Don't soften.>"
— <name or role>, <when>, <anonymized if needed>

## 5. Action Items
Each categorized: rule / hook / test / criterion / memory.
Each has a file path, an owner, a due date.

- AI-1. <Action> → <CATEGORY>. File: <path>. Owner: <name>. Due: <date>.
- AI-2. <Action> → <CATEGORY>. File: <path>. Owner: <name>. Due: <date>.

## 6. Lessons
One-liners for memory. Generalizable to other incidents in this
class.

- L1: <Lesson>
- L2: <Lesson>

## 7. Close-out Verification
The postmortem is NOT closed until every AI has shipped artifact
evidence. Mark all as INCOMPLETE initially. Re-walk and verify
before declaring closed.

- AI-1: <evidence required — commit hash + test name>. Status: INCOMPLETE
- AI-2: <evidence required>. Status: INCOMPLETE

**Gate:** New work in the &lt;domain&gt; domain is BLOCKED until all AIs verify.

Seven sections. Each forces a discipline. Skip a section and the agent (or you) had to actively skip it — the structure is what prevents passive drift.

Five template failures

№ 01

The narrative postmortem

claim looks like“Three pages of prose explaining what happened. No structure.”

what’s missingFuture-you can't extract action items. Agents can't parse "what was the structural cause." The format hides the load-bearing content.

the moveSeven sections, each with a defined purpose. Prose lives inside the sections, not as the whole document.

№ 02

The blame postmortem

claim looks like“Section: "What went wrong: the agent ignored the rule."”

what’s missingAgents do exactly what they're trained to do. Blaming the agent is blaming the substrate. The structural cause — the missing hook, the unclear rule, the gap in the contract — gets ignored.

the moveSubstitution test (Operating 02). Every cause is structural. The substitution test is hard-coded into the template.

№ 03

The action-items-on-paper postmortem

claim looks like“Action items listed. Owners assigned. Postmortem closed. Action items never ship.”

what’s missingINC-031 happened because INC-030's action items were scheduled, not delivered. The template needs a close-out clause that blocks new work in the failed domain until action items are verified done.

the moveSection 7 of the template is the close-out verification. The postmortem isn't closed when written; it's closed when every action item has shipped artifact evidence.

№ 04

The postmortem with no owner sentiment

claim looks like“Cost section says "4 hours of compute." No human voice in the document.”

what’s missingOwner sentiment is what makes the postmortem read by future-you. Without it, the document is dry; with it, the cost is real.

the moveSection 4 of the template requires a quote — yours or whoever paid the cost. Real language, not softened. The cost was real; the writeup should sound real.

№ 05

The postmortem the agent doesn't read

claim looks like“Postmortem written. Filed in `memory/postmortems/`. Agent never loads it on the next session.”

what’s missingPostmortems that don't load are postmortems that don't change behavior. The same incident class recurs.

the moveMemory entry: "recent postmortems" loads on session start. Or: planning gate references the postmortem folder before approving any new work in a previously-failed domain.

Pattern #5 is the meta-failure. A template that doesn’t load is a template that doesn’t change behavior. The MEMORY.md index entry is what makes the postmortem load-bearing across sessions.

The fill-the-template prompt

Run this when an incident occurs. Agent reads the template, fills each section, refuses to skip any. Output is a complete postmortem ready for human review.

Fill the postmortem template

Read the template at memory/postmortems/_TEMPLATE.md.
Fill it for this incident: <describe>

For each section:

1. WHAT HAPPENED — timeline of events with timestamps. Just facts.
2. STRUCTURAL CAUSES — apply the substitution test. List only
   structural causes. Personal/individual causes don't go here.
3. WHAT WE GOT RIGHT — process steps that worked. Brief.
4. OWNER SENTIMENT — quote from whoever paid the cost. Real
   language, not softened.
5. ACTION ITEMS — each categorized as rule / hook / test /
   criterion / memory. Each has a file path and an owner.
6. LESSONS — one-liners for memory. Generalizable to other
   incidents in the same class.
7. CLOSE-OUT VERIFICATION — list each action item with the
   evidence required for it to be considered shipped (commit
   hash, file path, test name). Mark all as INCOMPLETE initially.

Output the filled template. Do not skip sections.

The structure forces the discipline. Agent fills sections 1-7; human reviews and signs off section 4 (sentiment) + section 7 (close-out gate).

The prompt explicitly forbids skipping sections. The most common skip is section 4 (owner sentiment) — agents are uncomfortable writing as the operator. Prompt them: paste the actual quote from chat, anonymize if needed, do not paraphrase.

INC-031 walkthrough

Real incident: second cloud GPU pod loss in four days. The postmortem is filled on rails, the close-out gate stays closed until 5/5 action items verify, then the gate flips and new work resumes.

INC-031 — postmortem-on-rails in action

Five action items. Three categories (rule + hook + criterion). Close-out gate gates new GPU work until verified.

operator
INC-031 — second cloud GPU pod loss in four days. Same root cause class as INC-030. Write the postmortem on rails using the template.

agent

Reading INC-030 first to understand what was promised but didn't ship. Loading the template at memory/postmortems/_TEMPLATE.md. Filling section by section.

# INC-031 — Cloud GPU Pod Loss (recurring)
**Date:** 2026-MM-DD
**Severity:** P1 (4-hour productive run lost)
**Status:** Closed pending action item ship

## 1. What Happened (timeline)
- 14:00 — Long training run launched (planned 14h)
- 18:14 — SSH heartbeat dropped; launcher's atexit cleanup fired; pod destroyed
- 18:20 — Operator notices job gone
- 18:35 — Postmortem started (this document)

## 2. Structural Causes (substitution test applied)
Walking each step. Any step that would have failed with someone else at the keyboard is structural.

2a. Pod lifecycle tied to launcher lifecycle. Anyone using this launcher loses pods on disconnect. STRUCTURAL.
2b. Heartbeat retry was 3 retries / 90 sec. Real network blips can last minutes. Anyone running this launcher has the same window. STRUCTURAL.
2c. INC-030 action items existed but planning gate did not enforce that they shipped before the next run. The gate process itself failed. STRUCTURAL META-CAUSE.

## 3. What We Got Right
- Postmortem started within 30 min of incident
- INC-030 was correctly diagnosed (action items were sound)
- Substitution test applied without finger-pointing

## 4. Owner Sentiment
> "I'm fine with paying for compute. I'm not fine with paying for compute that gets thrown away because the laptop sneezed."
— operator, day-of, anonymized

## 5. Action Items
Each categorized (rule / hook / test / criterion / memory).

AI-1. Add `auto_destroy=False` parameter to launcher with 240-min threshold. → RULE + code change. File: `cloud/launcher.py`. Owner: Mike. Due: this week.
AI-2. Heartbeat retry tuned per run length (5 retries / 5 min for runs > 4h). → RULE + code change. Same file. Same owner.
AI-3. Independent watchdog process for runs > 4 hours. → HOOK. New file: `cloud/watchdog.sh`. Owner: Mike.
AI-4. Helm planning gate refuses new GPU work in the cloud-training domain until prior postmortem AIs are verified done. → RULE + planning-gate enforcement. File: `.claude/rules/cloud-training.md`. Owner: Mike.
AI-5. Postmortem template gets a "close-out verification" section that requires evidence (commit hash, hook filename, test name) for each AI. → SCHEMA. File: `memory/postmortems/_TEMPLATE.md`. Owner: Mike.

## 6. Lessons (one-liners for memory)
- L1: Postmortem AIs must ship before the next run in a previously-failed domain. Otherwise the postmortem is a writing exercise.
- L2: Pod lifecycle != launcher lifecycle for runs > a threshold. Decouple explicitly.

## 7. Close-out Verification
The postmortem is NOT closed until each AI has artifact evidence:
- AI-1: commit hash + test name (currently passing)
- AI-2: commit hash + test name
- AI-3: hook filename + last commit that added it
- AI-4: rules file path + planning-gate commit
- AI-5: template file path + last-modified timestamp

Status: 0/5 verified. NEW WORK IN cloud-training DOMAIN BLOCKED until 5/5.

This is the postmortem. The close-out gate at section 7 is what makes it on-rails.

operator
Two days later. AI-1 and AI-3 shipped per commits. AI-2, AI-4, AI-5 still open. Can I launch a new GPU run?
agent
Walking the close-out gate at section 7. Status: 2/5 verified.
AI-1: commit abc123 adds auto_destroy param. Test test_auto_destroy_threshold passing. ✅
AI-3: cloud/watchdog.sh exists, last commit def456. ✅
AI-2: heartbeat retry tuning — NOT shipped. Code still uses 3 retries / 90 sec.
AI-4: planning gate rule — NOT shipped. No .claude/rules/cloud-training.md exists.
AI-5: template close-out section — NOT shipped. Current template at memory/postmortems/_TEMPLATE.md does not contain section 7.
The gate says: NO. Three of five action items have not shipped. New work in cloud-training is blocked until 5/5.
The correct next step is to ship AI-2, AI-4, AI-5 — not to launch a new run that would re-expose the same failure surface.
operator
Ship AI-2, AI-4, AI-5. When all five verify, close the postmortem.
agent
Done:
- AI-2 commit ghi789 — heartbeat retry now 5 retries / 5 min for runs > 4h. Test test_heartbeat_long_run passing.
- AI-4 commit jkl012 — .claude/rules/cloud-training.md adds the planning-gate clause. Helm reads on session start.
- AI-5 commit mno345 — template now contains section 7 close-out. Future postmortems inherit the structure.
Close-out gate: 5/5 verified. INC-031 closed.
Memory entry added to memory/MEMORY.md index pointing to INC-031 lessons L1 + L2. Loads on every session in the cloud-training domain.
New GPU work is now unblocked.

Prompt kit

Three prompts: fill the template, close-out verification, weekly index walk. Save in your CLAUDE.md.

Fill the template (run when an incident occurs)

Read the template at memory/postmortems/_TEMPLATE.md.
Fill it for this incident: <describe>

For each section:

1. WHAT HAPPENED — timeline of events with timestamps. Just facts.
2. STRUCTURAL CAUSES — apply the substitution test. List only
   structural causes. Personal/individual causes don't go here.
3. WHAT WE GOT RIGHT — process steps that worked. Brief.
4. OWNER SENTIMENT — quote from whoever paid the cost. Real
   language, not softened.
5. ACTION ITEMS — each categorized as rule / hook / test /
   criterion / memory. Each has a file path and an owner.
6. LESSONS — one-liners for memory. Generalizable to other
   incidents in the same class.
7. CLOSE-OUT VERIFICATION — list each action item with the
   evidence required for it to be considered shipped (commit
   hash, file path, test name). Mark all as INCOMPLETE initially.

Output the filled template. Do not skip sections.

Close-out verification (run before unblocking related work)

Read postmortem at <path>. Walk section 7 (Close-out
Verification). For each action item:

- Find the artifact (commit, file, test)
- Confirm it exists at the path named
- Confirm tests pass / hooks fire / rules load
- Mark VERIFIED or INCOMPLETE

If any action item is INCOMPLETE, the close-out gate stays
closed. New work in the failed domain remains blocked.

Output: status of each AI + gate state (CLOSED / OPEN).

Postmortem index (run weekly)

Walk every postmortem in memory/postmortems/. For each:

- Title + incident date
- Status (open / closed)
- Action items shipped / total
- Days since last close-out check

Flag any open postmortem older than 30 days — those are zombie
postmortems where action items never shipped.

Flag any closed postmortem whose lessons (section 6) are not
referenced from MEMORY.md — those lessons aren't loading on
future sessions.

Output: triaged list, top 3 to act on this week.

Apply this

45-minute exercise. Drop in the template. Write one postmortem on rails. Verify the close-out gate.

Put postmortems on rails

Each step takes 5–15 minutes. Progress saves automatically.

0/5

01Drop the 7-section template into memory/postmortems/_TEMPLATE.md.Use the structure from this lesson: timeline, structural causes, what-we-got-right, owner-sentiment, action items, lessons, close-out verification.
02Write one postmortem from a recent incident using the template.Run the fill-the-template prompt. Don't skip sections — the discipline is the structure.
03Verify each action item with a real artifact (commit hash, test name, file path).Section 7 is the gate. Every AI needs evidence. "Will do tomorrow" is not evidence.
04Add a planning-gate rule: no new work in the failed domain until close-out is 100%.This is the rule that prevents INC-030 → INC-031. The gate has teeth because the agent reads it on every relevant session.
05Index lessons (section 6) from MEMORY.md so they load on every session.Lessons that don't load don't change behavior. The memory index is what makes the postmortem load-bearing across sessions.

Operating tier · what's next

After this lesson

Operating · № 19● live

Decision logs in practice

Why decision logs beat status meetings. Format, cadence, retrieval pattern.

16 min read · 30 min apply

Operating · № 20● live

What public disclosure does to your patent rights

Pitch without an NDA, publish a write-up, prompt a consumer LLM — each can be a disclosure. The grace period, the foreign-filing trap, and what to check before anything goes public.

12 min read · 60 min apply

Operating · № 21● live

AGPL vs Apache: the license decision that ships or sinks your product

Permissive vs strong-copyleft, the AGPL SaaS loophole, and why a single library can turn your whole codebase open. The license check that belongs in your workflow.

9 min read · 15 min apply

Operating · № 22● live

Picking a computer-vision wedge: start with a fixed playfield

The hardest version of a CV problem is an open field; the easiest is fixed geometry you can pin with homography. How to choose the narrow, solvable entry point — then generalize.

11 min read · 20 min apply