TPL-2026-007·preprint·2026-04-30

Brief-Drift Detection in Long Agent Sessions: An Empirical Audit of Self-Diagnostic Reliability

TruPath Labs Research · TruPath Ventures · Stanley, NC

operationsagent-sessionsself-diagnosticmethodology

Abstract

Long agent sessions — those exceeding two hours of continuous operation — accumulate a characteristic failure pattern we term brief-drift: the gradual divergence of agent behavior from its original operator-specified constraints, without any single identifiable failure event. We evaluated two self-diagnostic checks proposed in the Expert L08 context-drift lesson — an anchor check (agent restates original task and active constraints) and a ground-truth check (agent verifies a recent factual claim against actual artifacts) — against an external-observer ground truth across n=50 sessions drawn from a 60-day single-operator study window. All sessions exceeded two hours; checks were fired every 60 minutes. Drift was labeled post-hoc by the operator reviewing session replays. Results are drawn from a parameterized simulation calibrated to observed session behavior and are explicitly labeled as such throughout. The anchor check achieved high recall on constraint-forgetting (estimated recall 0.82, 95% bootstrap CI: 0.71–0.91) at medium precision (0.61, CI: 0.49–0.72), consistent with over-flagging when sessions legitimately pivoted under operator approval. The ground-truth check achieved high precision on hallucination-class drift (0.88, CI: 0.76–0.96) at lower recall (0.54, CI: 0.42–0.67), missing framing and voice drift almost entirely. Combined firing — triggering on either check — produced the best F1 (0.74, CI: 0.63–0.83). Detection latency increased with session length: median latency was 18 minutes in sessions under three hours but 41 minutes in sessions exceeding five hours, suggesting slow drift is substantially harder to surface than sudden constraint violations.

1. Introduction

Long agent sessions accumulate a failure mode that does not manifest as a single observable error: the agent’s behavior gradually diverges from the original operator-specified brief without any punctuated transition. We call this brief-drift. Unlike an outright hallucination or a tool-call failure — both of which are discrete and detectable — brief-drift is diffuse. It accrues across tens of exchanges, often invisibly, until the session’s output bears little resemblance to what the operator originally requested.

The Expert L08 context-drift lesson proposes two self-diagnostic checks as a countermeasure. The anchor check asks the agent to periodically restate its original task and active constraints in full; the ground-truth check asks the agent to verify a recent factual claim against an actual artifact (file, schema, metric) before continuing. Both checks are designed to be fired periodically — roughly every 60 minutes in long sessions — as lightweight interrupts that surface drift before it compounds. The working memory literature provides the theoretical grounding: Baddeley’s episodic buffer model [2] describes how active maintenance of goal-relevant information degrades under competing load, with direct parallels to the context-window dynamics described in Anthropic’s context-engineering documentation [6]. Cognitive load theory [3] further predicts that increasing task complexity — as accumulates over a long session — reduces the capacity available for constraint tracking.

The empirical question is straightforward: how reliably do these self-diagnostic checks surface drift relative to an external observer’s ground-truth judgment? We evaluate both checks and their combination across n=50 long sessions (>2 hours) drawn from a 60-day single-operator study window. The headline result: the combined check achieves F1=0.74 (95% CI: 0.63–0.83) against external-judge ground truth, with substantially different precision-recall profiles across the four drift categories we identify.

Subscribers only · continued

The rest of TPL-2026-007 is for subscribers.

Brief-Drift Detection in Long Agent Sessions: An Empirical Audit of Self-Diagnostic Reliability

Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
Every research paper — methodology, figures, tables, reproducibility appendices
New Expert lessons + papers as they ship (quarterly cadence)
Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff

Become a subscriber — free →Already a subscriber? Sign in

Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.