
Production debugging playbook
When an agent misbehaves in prod: the diagnostic ladder. Five rungs from “claim says done but isn’t” through context drift, tool failures, schema mismatches, upstream model regressions.
The broken agent problem
Something is wrong. An agent that was working last week is producing bad outputs today. Your first instinct is to ask why — and the instinct immediately generates hypotheses: the model API changed, the data is different, the prompt is drifting. Each hypothesis feels plausible. Each one points to a different investigation. Without a systematic approach, you’re picking among them by intuition, and intuition in production debugging is expensive.
The core insight: production failures in agent systems form a natural ordering by diagnosis cost. The cheapest failure to detect (claim/artifact mismatch) is also the most common. The most expensive (upstream model regression) is the rarest. Operators who work from the top of the cost ordering to the bottom spend hours on unlikely causes before reaching the probable one. The diagnostic ladder inverts this: start cheap, climb only when the cheaper rungs are genuinely clear.
The ladder has five rungs. Each rung takes roughly five minutes to clear. A complete five-rung climb costs approximately 25 minutes. In most incidents the ladder stops before rung 5 — often before rung 3. The 25-minute ceiling is the worst case, not the median.
| Rung | Failure class | Time to clear | Prevalence |
|---|---|---|---|
| 1 | Claim verification — agent says done but artifact doesn’t match | ~2 min | Most common |
| 2 | Context drift — long session has decayed constraints | ~5 min | Common in sessions >2 hrs |
| 3 | Tool failures — a dependency is silently returning wrong data | ~5 min | Common when tools touch external APIs |
| 4 | Schema mismatches — agent’s data model inconsistent with prod | ~5 min | Common after data model changes |
| 5 | Upstream model regression — model itself has changed | ~10 min | Rare; never the first hypothesis |
The ladder is not a guarantee of a fast resolution. Some failures are genuinely hard. Rung 5 can consume significant investigation time when it is the actual cause. What the ladder prevents is spending rung-5 time on rung-1 problems — which, in practice, is the most common way production debugging goes wrong.
The diagnostic ladder
Rung 1: Claim verification
The agent says it finished. Verify the artifact directly. Do not read the agent’s summary of what it produced — open the file, query the record, load the URL, run the test. Does the artifact match the claim?
This is the most common failure mode in production agent systems. An agent can believe with complete internal consistency that a write succeeded when the write never landed. It can summarize a document it hallucinated rather than fetched. It can report a test passed without running the test. None of these failures involve the agent lying — it genuinely believes the claim. The only check that catches them is direct artifact inspection.
Operating L01 covers the full claim-verification diagnostic. The habit it builds — spot-checking one agent claim per session against the actual artifact — catches this failure class before it reaches production. If that habit isn’t yet in place, most rung-1 failures become production incidents by default.
Time to clear: 2 minutes. If the artifact matches the claim, rung 1 is clear. Climb to rung 2.
Rung 2: Context drift
Long sessions degrade. A session running for several hours on a complex task accumulates context that can crowd out or override the original constraints. The agent doesn’t forget constraints deliberately — they simply become less salient as the session depth increases. Behavioral drift is the symptom: outputs that would have been correct at session start are subtly wrong by hour four.
The anchor check is the rung-2 diagnostic: ask the agent to restate its current constraints, then compare the restatement against the original brief or CLAUDE.md. Any divergence — a narrowed constraint, a missing rule, a reworded requirement that subtly changes behavior — is context drift. Expert L08 covers the detection patterns in depth. TPL-2026-007 provides the quantitative analysis: anchor-check detection rates by drift category, with constraint-forget (0.87 detection rate) and scope-drift (0.71) as the two categories it catches most reliably.
Fix: restart hygiene. If drift is detected, restart the session with the original brief and the artifact from where the session left off. Do not try to patch a drifted session in place — the drift is in the context window, not in any single output. Only a new session with a clean context baseline resolves it.
Time to clear: 5 minutes. If the anchor check shows no divergence, rung 2 is clear. Climb to rung 3.
Rung 3: Tool failures
A tool the agent depends on is silently returning wrong data. The failure is invisible from the agent’s perspective: the tool call succeeds (returns a response, no exception), but the response is incorrect — an HTTP 4xx absorbed as success, a stale cache returned instead of live data, an empty payload treated as a valid result. The agent reports success because its tool reported success. The tool reported success because the wrapper has no status-code validation.
The diagnostic is to check the tool’s own logs, not the agent’s description of the tool call. Every tool that calls an external endpoint should produce its own execution record — HTTP status codes, response sizes, latency. If the tool is a vendor API, check the vendor’s request log. If it’s an internal endpoint, check the endpoint’s access log. Compare what actually happened to what the agent reported.
Common rung-3 patterns: a rotated API key producing 401s that the wrapper swallows; a rate-limited endpoint returning 429s that the wrapper treats as empty success; a database endpoint returning a partial result set due to a timeout that the wrapper does not surface. In all cases, the agent is not at fault — the fault is in the reliability contract between the tool layer and the agent.
Time to clear: 5 minutes for tool logs inspection. Fixing a silent-failure tool is a separate task — but identifying it as the cause takes minutes, not hours.
Rung 4: Schema mismatches
The agent’s model of the data is internally consistent but inconsistent with the actual production data shape. This failure is invisible in development because dev data matches the schema the agent was designed around. It surfaces in production when the data has evolved: a field was renamed, a previously required field became nullable, a new field was added that the agent doesn’t know to look for.
The pattern is distinctive: the agent works correctly on development or sample data, fails on production data, and fails in a structured way — not randomly. The outputs are wrong in a consistent direction that traces to a specific field or type assumption.
The diagnostic is a direct comparison: dump a sample of actual production data and compare it field by field against the agent’s expected schema (documented in the agent’s brief, CLAUDE.md, or the tool definitions). TPL-2026-001 covers eval/train overlap as one specific form of schema mismatch — where the evaluation set shares characteristics with training data that don’t hold in production. The same diagnostic logic applies: the mismatch between what the agent was calibrated against and what it now receives.
Time to clear: 5 minutes for a field-level comparison. Fixing the mismatch — updating the agent’s schema assumptions or the data pipeline — is a separate task.
Rung 5: Upstream model regression
The model itself has changed. A vendor shipped a silent update; a model version that was stable has been deprecated and replaced. The same prompts produce structurally different outputs. This is the rarest failure class and the hardest to diagnose because it requires a historical baseline to detect — you need to know what the outputs looked like before to see that they’ve changed now.
The diagnostic has two steps. First, pin the model version: if you are not specifying the exact model version in your API call, you may be consuming whatever the vendor’s “latest” alias points to, which changes without notice. Pinning to a specific version ID stops the bleeding. Second, compare outputs: take a representative prompt from the pre-failure period and run it against the current model. Compare the response structure, length, and key output fields against a stored historical example.
Rung 5 is the correct explanation for a small number of incidents. It is the hypothesis most commonly invoked at the start of debugging and the least often confirmed. The asymmetry is worth internalizing: when a production agent misbehaves, model regression is the last thing to check, not the first.
Time to clear: 10 minutes with a stored baseline. Without a stored baseline, rung 5 clearance requires rebuilding one from available context — which is a reason to maintain baselines proactively.
The rest of Expert · Lesson 11 is for subscribers.
Production debugging playbook
- Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
- Every research paper — methodology, figures, tables, reproducibility appendices
- New Expert lessons + papers as they ship (quarterly cadence)
- Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff
Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.