Labs / Playbook / postmortem-discipline
Discipline● liveCompanion · Field Notes № 0003

Postmortem Discipline

Blameless postmortems with structural action items. The discipline that makes the same mistake stop happening twice.

DISCIPLINE
  • 26+ postmortems shipped across QC and Parley
  • Owner-sentiment section is required, not optional

A pre-flight checklist meant to *verify* the camera setup actually mutated the camera config. The Brio production camera got remapped to the MacBook's built-in webcam. The previous day's verified configuration was destroyed by a single auto-detect call. I lost 4.5 hours of test time on March 20.

Two weeks later it almost happened again. Same class of failure, different code path. The reason it almost happened again — and the reason it didn't — is the entire subject of this bundle.

Why most postmortems don't help

Most teams do one of two things when something breaks:

Skip the postmortem entirely. "It's fixed, let's move on." Six weeks later the same class of failure happens again because nothing structural was learned.

Write a postmortem that names a person. "Bob shouldn't have run the migration without testing." Bob is now defensive. Nothing structural was learned. Six weeks later the same class of failure happens again *and* Bob is gone.

The discipline that's worked for me — adapted from aviation, healthcare, and SRE practice — does neither. It treats every incident as evidence of a structural gap and the output of every postmortem is a structural change, not a person to blame.

When the contributor at the keyboard is sometimes a Claude agent, the case for blamelessness gets stronger, not weaker. There's nothing morally meaningful about "blaming Claude." But the practical case — that blame hides incidents, drives defensive narratives, and produces no learning — applies just as much to agentic work. The agent will obediently say "I won't do that next time" and you have absolutely no way to enforce that promise.

The mechanics

The template has these sections.

Headline — 2-3 sentences, the if-you-only-read-one-paragraph version.

Timeline — observable events with timestamps, no editorializing.

Impact — time, money, work destroyed, trust cost.

Root causes — structural, numbered, *not personal*.

Contributing factors — things that amplified the problem.

What went well — yes, even bad incidents have things that went right.

What went wrong — process failures, missing tools, missing context.

Structural changes — table of action items with owners and target dates.

Owner sentiment — direct quote from whoever paid the cost.

Lessons codified — bullets that get copied verbatim into the relevant rules file.

Status — open until action items are verified done.

There's a rule embedded in the template that's worth pulling out: the postmortem is not closed when the document is written. It's closed when the action items are done.

In the bundle, the fourth example (cloud pod loss) is a real instance of two postmortems with near-identical structural causes — INC-030 and INC-031, four days apart. INC-030 was written, the action items were listed, the fix was scheduled. The fix had not shipped before INC-031 hit.

The lesson is stronger than the original: postmortem action items have to ship *before the next run in the same domain.* No exceptions. In my system, the planning gate now refuses new work in a previously-failed domain until the relevant action items are verified done.

The owner-sentiment section

This section is unusual. Most postmortem templates skip it.

The reason it's there: agentic work tends to drift toward narratives that are easy on the writer. *"The system performed within expected bounds."* *"We identified an opportunity for improvement."* This is the language of consultancies, not of engineers.

When the postmortem includes a direct quote from the person who paid the cost — a customer, a partner, the operator — it grounds the document. The cloud-pod-loss postmortem in the bundle ends with: *"I'm fine with paying for compute. I'm not fine with paying for compute that gets thrown away because the laptop sneezed."*

That quote does more work than five paragraphs of impact analysis. It's also a decent test of whether the postmortem is serving its purpose: if the quote section is missing or sanitized, the postmortem is performing accountability instead of doing it.

What's in the bundle

The TEMPLATE.md, four worked examples drawn from real production incidents (camera mapping disaster, data contamination, Phase 0 gate failure, cloud pod loss), the framing essay, and an MIT license.

The first three postmortems you write will be slower than your eventual workflow. By postmortem 10-15, the structure feels natural and you'll be writing them faster than you can write rationalizations.

After about 25 postmortems, your repository will have accumulated something most teams never have: a structural memory of what it has learned, embedded in the system itself.

— Michael, from the lab