Cloud training discipline

Pod lifecycle decoupling, exit-path audits, watchdog cadences. So you don’t lose a 14-hour run.

20 min read · 90 min applyprereq: Operating 02 (postmortem discipline)

Where training runs vanish

Lost training runs are the single most expensive class of incident in agentic ML work I’ve hit. A 14-hour run that vanishes is 14 hours of compute paid for, 14 hours of wall-clock waiting, and an unknown amount of operator confidence destroyed. The same incident class repeats in different forms (laptop slept, network dropped, terminal closed, atexit fired) and each time the symptom is the same. The pod is gone, the artifacts are gone, the run has to start over.

Structural cause is almost always the same. Pod lifecycle is coupled to a process on the operator’s machine. Anything that kills that process kills the pod. The cloud platform isn’t at fault. The launcher is.

The discipline below decouples pod lifecycle from launcher process. After it lands, none of the laptop-side failures can kill a run. The pod runs to completion or to a paid-for time threshold. The operator knows within 15 minutes if anything’s wrong. Artifacts get pulled by independent infrastructure.

Decoupling pod lifecycle from launcher

Single biggest move. Three properties to enforce, in order.

Property	Why it matters	Implementation
Auto-destroy threshold on pod	Worst-case bound on lifetime if everything else fails	Set at pod creation, e.g., `auto_destroy_after=24h`. Generous but finite.
Launcher exits after creation	Removes the launcher process as a failure point	Print pod ID + dashboard URL; exit cleanly. No `wait_for_completion()`.
Cleanup conditional on positive completion	Pod destroys only when training succeeds, not on signals	Training script writes a “done” signal; pod self-destroys on the signal. No atexit.

Auto-destroy threshold is the safety net. If the positive-completion signal never arrives (training crashed silently, network partition, anything), the pod runs to threshold and then dies. You pay for the full threshold but the pod is bounded. Not running for a week racking up charges.

Subscribers only · continued

The rest of Expert · Lesson 07 is for subscribers.

Cloud training discipline — postmortem-driven SOPs

Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
Every research paper — methodology, figures, tables, reproducibility appendices
New Expert lessons + papers as they ship (quarterly cadence)
Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Become a subscriber — free →Already a subscriber? Sign in

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.