Production incident response with agents

The on-call playbook: paging, evidence collection, triage ladder, and postmortem-on-rails. For agentic systems that break in production.

20 min read · 60 min applyprereq: Operating 02 (postmortem discipline)

When agentic systems break in production

Production incidents are different when agents are in the loop. A human developer who causes a production bug is a single failure point. An agent given broad authority to “fix the scoring system” during an active incident can generate a compound incident — multiple changes, multiple new bugs, no clear record of what was done or in what order.

The asymmetry that matters: agents act faster than humans and log less by default. Under incident pressure, an operator who gives an agent broad authority and asks it to move quickly will often get exactly that — a fast sequence of undocumented changes that resolves one problem and introduces two more. The incident response protocol below is designed to use agent speed correctly while preventing agent-amplified incidents.

The QC scoring system runs with triple-layer sensor fusion — RT-DETR-S CoreML on the Jetson Orin Nano, HX711 ADC pressure sensors, and IR break beam — served by a Python FastAPI backend (CV on :8642, LLM on :8643) with a Next.js Mission Control dashboard and Supabase backend (project szafqxabhkkhzuvebkaj). Each layer can produce incidents. The protocol below applies to all of them — and to any production agentic system with multiple layers and a customer-visible output.

The five-rung triage ladder

The triage ladder structures the investigation from fastest-to-check to most-invasive. Each rung eliminates a class of root cause before moving to the next. Don’t skip rungs — “I think it’s X” is not a triage. Run the ladder.

Rung	Check	Eliminates
1 — Claim	Does the system say it completed a task it didn’t actually complete? Is the output matching the claim?	False completion signals; agent claiming done when not done
2 — Context drift	Is this a long-running session? Has the session context accumulated enough turns that the agent may have lost accuracy?	Session-length accuracy degradation; restart may resolve
3 — Tool failure	Did any tool call fail silently? (API timeout, permission error, network partition not surfaced as an error)	Silent tool failures masquerading as logic errors
4 — Schema mismatch	Did the data shape change upstream without the consuming layer being updated?	Upstream schema changes breaking downstream consumers
5 — Model regression	Did an upstream model (LLM, CV model, scoring FSM) change behavior without a release?	Undocumented model behavior changes (e.g., CV confidence distribution shift)

Rung 1 is the fastest and catches the most common class of agentic system failure. Rung 5 is the hardest to diagnose and requires the most evidence. Never jump to Rung 5 without eliminating Rungs 1–4.

Subscribers only · continued

The rest of Expert · Lesson 15 is for subscribers.

Production incident response with agents

Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
Every research paper — methodology, figures, tables, reproducibility appendices
New Expert lessons + papers as they ship (quarterly cadence)
Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Become a subscriber — free →Already a subscriber? Sign in

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.