Expert · Lesson 07 — Cloud training discipline — postmortem-driven SOPs
E07Expert
Expert · Lesson 07● live

Cloud training discipline

Pod lifecycle decoupling, exit-path audits, watchdog cadences — don’t lose a 14-hour run.

20 min read · 90 min applyprereq: Operating 02 (postmortem discipline)

Why training runs vanish

Lost training runs are the single most expensive class of incident in agentic ML work. A 14-hour run that vanishes is 14 hours of compute paid for, 14 hours of wall-clock waiting, and an unknown amount of operator confidence destroyed. The same incident class repeats in different forms — laptop slept, network dropped, terminal closed, atexit fired — and each time the symptom is the same: the pod is gone, the artifacts are gone, the run has to start over.

The structural cause is almost always the same. Pod lifecycle is coupled to a process on the operator’s machine. Anything that kills that process kills the pod. The cloud platform is not at fault — the launcher is.

The discipline below decouples pod lifecycle from launcher process. After the discipline lands, none of the laptop-side failures can kill a run. The pod runs to completion or to a paid-for time threshold; the operator knows within 15 minutes if anything’s wrong; artifacts are pulled by independent infrastructure.

Decoupling pod lifecycle from launcher

The single biggest leverage. Three properties to enforce, in order.

PropertyWhy it mattersImplementation
Auto-destroy threshold on podWorst-case bound on lifetime if everything else failsSet at pod creation, e.g., auto_destroy_after=24h. Generous but finite.
Launcher exits after creationRemoves the launcher process as a failure pointPrint pod ID + dashboard URL; exit cleanly. No wait_for_completion().
Cleanup conditional on positive completionPod destroys only when training succeeds, not on signalsTraining script writes a “done” signal; pod self-destroys on the signal. No atexit.

The auto-destroy threshold is the safety net. If the positive-completion signal never arrives (training crashed silently, network partition, anything), the pod runs to threshold and then dies. You pay for the full threshold, but the pod is bounded — not running for a week racking up charges.

Subscribers only · continued

The rest of Expert · Lesson 07 is for subscribers.

Cloud training discipline — postmortem-driven SOPs

  • Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
  • Every research paper — methodology, figures, tables, reproducibility appendices
  • New Expert lessons + papers as they ship (quarterly cadence)
  • Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.