Research · Expert tier

Papers, not posts.

17 preprints · MIT-licensed where possible

Long-form, statistically pinned writeups of the methodology behind the lessons. Every paper has methods, results, error bars, limitations, references, and a reproducibility appendix. The posture is restrained academic — pin every number to a method, hedge appropriately, cite real work.

TPL-2026-012● preprintopen access2026-05-01

Hardware-in-Loop Agent Failure Modes: 60 Days of ESP32 + CV + LLM Co-Development

When LLM agents simultaneously drive firmware authorship (ESP32), computer-vision pipeline evolution (RT-DETR / YOLO11 on Jetson Orin), and sensor-layer calibration, failures do not distribute uniformly. We catalog 94 agent-attributed failures across 60 days of the Quantum Caddy smart-board development cycle and classify them into six categories: context-collapse, hardware-assumption drift, calibration-loop divergence, firmware-CV protocol mismatch, sensor-signal hallucination, and safety-boundary erosion. Mean-time-to-resolution (MTTR) ranges from 18 minutes (sensor-signal hallucination) to 4.2 hours (firmware-CV protocol mismatch). Cost-per-class spans $0.11 to $2.40 per incident. Context-collapse and protocol-mismatch account for 61% of total remediation cost despite representing only 38% of incident count. All numbers are illustrative, synthesized from the QC development log; see §6 Limitations.

TruPath Labs Research·14 min read·hardware / embedded-systems / computer-vision / LLM-agents / ESP32 / Jetson / failure-taxonomy

TPL-2026-013● preprintopen access2026-05-01

Cost-per-Artifact Curves Across Claude Model Tiers (Opus 4.7 / Sonnet 4.6 / Haiku 4.5)

Selecting the right Claude model tier for a given artifact type is the highest-leverage cost decision an operator makes in an agent-mediated workflow. We analyze cost-per-completed-artifact data across 312 artifact production runs spanning six task types (greenfield drafting, debugging, code review, research synthesis, structured extraction, and configuration authoring) and three Claude tiers (Opus 4.7, Sonnet 4.6, Haiku 4.5) in the TruPath portfolio. Cost-per-artifact curves diverge sharply by task type: Sonnet 4.6 dominates greenfield drafting and structured extraction; Opus 4.7 dominates complex debugging and novel-architecture decisions; Haiku 4.5 followed by Opus escalation outperforms both single-tier strategies for code review. A routing matrix derived from these findings is estimated to reduce portfolio-wide LLM spend by 31–44% at constant output quality. All figures are illustrative; see §5.

TruPath Labs Research·12 min read·cost-optimization / model-selection / LLM-agents / methodology / cross-venture

TPL-2026-014● preprintopen access2026-05-01

Sprint Contract Overrun: Updated Causal Taxonomy from 80+ Cross-Venture Sprints

TPL-2026-005 established that sprint-contract discipline reduces overrun rates in solo-operator agent-mediated work (median overrun 78% → 12%). That paper measured the effect; it did not decompose the causes. This paper extends that work with a causal taxonomy derived from 83 sprint contracts across three TruPath ventures (Quantum Caddy, Mile High Golf, Parley / cross-portfolio). We identify six root-cause classes that account for 94% of all overrun: scope-creep-in-flight, definition-of-done ambiguity, dependency-surface underestimation, agent-iteration overhead, hardware-external blocking, and operator-attention fragmentation. Frequency and severity differ significantly by venture: QC sprints are dominated by agent-iteration overhead and hardware blocking; MHG sprints by dependency underestimation and external blocking; Parley by definition-of-done ambiguity. An intervention table maps each cause class to the technique that most reduced incidence. All figures are illustrative; see §6.

TruPath Labs Research·13 min read·operations / sprint-contracts / methodology / cross-venture / overrun

TPL-2026-007● preprintopen access2026-04-30

Brief-Drift Detection in Long Agent Sessions: An Empirical Audit of Self-Diagnostic Reliability

Long agent sessions — those exceeding two hours of continuous operation — accumulate a characteristic failure pattern we term brief-drift: the gradual divergence of agent behavior from its original operator-specified constraints, without any single identifiable failure event. We evaluated two self-diagnostic checks proposed in the Expert L08 context-drift lesson — an anchor check (agent restates original task and active constraints) and a ground-truth check (agent verifies a recent factual claim against actual artifacts) — against an external-observer ground truth across n=50 sessions drawn from a 60-day single-operator study window. All sessions exceeded two hours; checks were fired every 60 minutes. Drift was labeled post-hoc by the operator reviewing session replays. Results are drawn from a parameterized simulation calibrated to observed session behavior and are explicitly labeled as such throughout. The anchor check achieved high recall on constraint-forgetting (estimated recall 0.82, 95% bootstrap CI: 0.71–0.91) at medium precision (0.61, CI: 0.49–0.72), consistent with over-flagging when sessions legitimately pivoted under operator approval. The ground-truth check achieved high precision on hallucination-class drift (0.88, CI: 0.76–0.96) at lower recall (0.54, CI: 0.42–0.67), missing framing and voice drift almost entirely. Combined firing — triggering on either check — produced the best F1 (0.74, CI: 0.63–0.83). Detection latency increased with session length: median latency was 18 minutes in sessions under three hours but 41 minutes in sessions exceeding five hours, suggesting slow drift is substantially harder to surface than sudden constraint violations.

TruPath Labs Research·12 min read·operations / agent-sessions / self-diagnostic / methodology

TPL-2026-008● preprintopen access2026-04-30

Same Task, Three Harnesses: A Throughput-Quality-Cost Benchmark of Claude Code, Codex, and Cursor on a Multi-Stage CV Pipeline Task

We report a case study — n=1 task, three agent harnesses, single operator — comparing Claude Code, OpenAI Codex CLI, and Cursor on the construction of a multi-stage computer-vision evaluation pipeline. The task was held constant across all three sessions: implement a pipeline that measures per-stage detection coverage on a held-out image sequence, accepting three callable detectors and producing a metrics report. Harness was the only variable. We measured time-to-working-artifact (minutes), total token cost, code-review score across four axes (correctness, clarity, brief-fidelity, test-coverage), and operator-intervention count. This is a case study, not a population-level claim. The contribution is a repeatable methodology for single-operator harness comparisons and one data point. Our results showed meaningful differences across harnesses on throughput-versus-quality tradeoffs that were not predicted by the harnesses' respective positioning. Claude Code showed the strongest brief-fidelity and lowest intervention count; Cursor completed the core artifact fastest per token; Codex produced the most self-contained, portable output. All three harnesses shipped a working pipeline. The appropriate interpretation is that each harness exhibits distinct ergonomic patterns favoring different operator contexts — not that any harness dominates. We provide the task brief, evaluation rubric, and scoring procedure in sufficient detail for independent replication.

TruPath Labs Research·16 min read·benchmark / agent-harnesses / comparison / methodology

TPL-2026-001● preprintopen access2026-04-30

Detecting Eval-Train Overlap in Production CV Pipelines: A Lightweight Audit Protocol

Computer vision systems shipped to production frequently report misleading evaluation metrics due to undetected overlap between training and held-out eval data. We present a sub-second hash-based audit protocol that runs adversarially in the eval pipeline and alarms on any F1 measurement above a configurable suspicion threshold. Applied retroactively to four production CV systems, the audit identified silent overlap in two of four pipelines with overlap rates between 6.2% and 14.8%, each producing inflated F1 scores of 0.97-0.99 that collapsed to 0.71-0.83 once overlap was removed.

TruPath Labs Research·14 min read·CV / evaluation / data-leakage / methodology

TPL-2026-011● preprintopen access2026-04-30

Long-Context vs Targeted Retrieval for Compliance Drafting: An SBA Loan Packet Case Study

Modern long-context language models (1M-token windows) make it tempting to load an entire reference corpus and draft against it; the implicit premise is that more context is always better when accuracy matters. We test this premise on a class of work where accuracy carries real consequences: SBA 7(a) loan packet drafting for Mile High Golf, a pre-launch entertainment venue. We pair-draft 12 SBA-style packet sections (use-of-funds narrative, projections justification, market analysis, owner resume, etc.) under two conditions — (A) full 1M-token context load of the SBA SOP, recent guidance, and prior-year MHG planning materials, and (B) targeted retrieval (scoped reads of the specific SOP sections relevant to the section being drafted). Drafts were evaluated by a human reviewer with SBA-packet experience using a 5-class hallucination taxonomy. Long-context drafts produced 2.8× more total hallucinations per packet section than targeted-retrieval drafts (mean 4.2 vs 1.5; n=12 paired sections). The class breakdown is informative: long-context drafts hallucinate <em>plausible-but-fabricated</em> regulatory citations at 6× the rate, suggesting the failure mode is haystack pollution rather than a missing fact. Operator time per section is comparable across conditions; long-context appears <em>cheaper</em> by token cost but is more expensive by reviewer-correction time. We argue that for regulatory drafting, scoped retrieval is the load-bearing primitive and long-context is a substitution test that adds risk without measurable reward. Mile High Golf is pre-launch, so no actual SBA packet has yet been filed; all drafts are illustrative, and the Limitations section names this and other honesty items.

TruPath Labs Research·12 min read·mhg / sba / compliance / long-context / retrieval / hallucination

TPL-2026-016● preprintopen access2026-04-30

Information-Loss Curves Across Multi-Agent Handoff Chains

Operators increasingly compose work as a chain of specialized LLM agents — a routing agent dispatches to a planning agent which delegates to a coding agent which calls a review agent. Each handoff is a serialization-deserialization step that loses information. We instrumented 142 handoff events across 38 chains spanning two ventures (Quantum Caddy CV pipeline + Parley AR research pipelines) and measured fact preservation at each agent boundary using a manually-curated rubric of 8 fact classes per chain. Mean per-handoff information retention was 91% at N=2, 76% at N=3, 58% at N=4, and 41% at N=5 — a roughly geometric decay consistent with independent loss probabilities at each boundary. We then evaluate four mitigation patterns (shared-state files, structured handoff envelopes, summarize-then-verify, retrieval-on-demand) and report effectiveness data for each. Structured handoff envelopes lift N=4 retention from 58% to 84% in our sample.

TruPath Labs Research·13 min read·multi-agent / orchestration / information-loss / cross-venture

TPL-2026-005● preprintopen access2026-04-30

Multi-Stage Fallback in Real-Time Computer Vision: A Methodology Study

Real-time computer vision systems that must operate under latency constraints face a tension between detection accuracy and inference speed. Single-stage pipelines optimized for peak accuracy often fail the latency budget under adverse conditions (motion blur, occlusion, low contrast), while single-stage pipelines optimized for speed sacrifice recall in those same conditions. Multi-stage fallback architectures address this by chaining detectors of increasing cost and decreasing confidence threshold: a fast primary stage handles the common case, and progressively slower stages activate only when the primary stage produces no confident detection. We present a methodology study of this pattern using simulated and public-benchmark data (COCO val 2017), characterizing the tradeoff space across stage count, confidence threshold, and latency budget. Under a 40 ms end-to-end budget, a three-stage pipeline achieves estimated mean F1 of 0.83 (95% CI: 0.79–0.87, n=500 simulated sequences) versus 0.71 (95% CI: 0.67–0.75) for a single-stage speed-optimized baseline. The improvement comes at a cost: worst-case latency for the three-stage pipeline approaches the budget ceiling in dense-occlusion scenarios. We discuss the design decisions required to implement this pattern safely — threshold calibration, stage activation logic, and graceful degradation when all stages fail — and document the reproducibility parameters for practitioners wishing to evaluate the pattern on their own detection domains.

TruPath Labs Research·14 min read·computer-vision / object-detection / real-time-inference / pipeline-design / methodology

TPL-2026-010● preprintopen access2026-04-30

A 7-Class Failure Taxonomy for ASR-Glasses Coordination in AR Research Sprints

Parley is an AR-glasses product for bidirectional deaf/hearing conversation; the research arm is in Phase-0/1 Kaggle-published exploration and the consumer hardware (Everysight Maverick AI) has not yet shipped. In this gap between research-coding and shipped-hardware, LLM agents must coordinate two semi-independent subsystems — automatic speech recognition (ASR, whisper-style transcription) and a glasses-render simulator that stands in for the not-yet-shipped HUD. We collect 119 agent-driven coordination attempts across 28 research sprints from 2026-02 through 2026-04 and classify failures into a 7-class taxonomy: timing-misalignment, schema-drift, latency-budget-overrun, render-format-mismatch, transcription-confidence-bypass, simulator-vs-hardware-divergence, and operator-context-leak. Frequency, mitigation effectiveness, and inter-rater agreement are reported per class. Three classes (timing-misalignment, schema-drift, render-format-mismatch) account for 71% of observed failures; mitigations applied as pre-flight gates outperform discipline-only mitigations 4.6× on subsequent recurrence. Because hardware has not shipped, all reported failures are simulation-mediated; the taxonomy will be re-validated against real-hardware sprints in Phase 4 and onward.

TruPath Labs Research·13 min read·parley / asr / ar / agents / taxonomy / research-sprints

TPL-2026-015● preprintopen access2026-04-30

Provisional Patent Draft Accuracy: Measured Rework Rates vs Human Baseline

LLM-assisted provisional patent drafting promises faster cycle time at the inventor-startup phase, but the question for operators is how much of the LLM draft survives attorney review. We ran a paired-draft protocol on a single provisional patent application — Quantum Caddy's smart-board scoring system — comparing an LLM-generated draft (Cipher agent, Claude Code) against a senior IP-attorney baseline. Rework rate, measured as percent of words materially edited or replaced before attorney sign-off, was 31% on independent claims, 18% on the abstract, 47% on the prior-art comparison section, and 22% on figures-and-drawings descriptions. Time-to-final-draft favored the LLM-assisted path by roughly 3.4× on background and embodiment sections, and by 1.6× on claims (where attorney review absorbed most of the savings). The failure modes the LLM introduced are non-random and clustered into six recurring categories. n is one application; this paper is positioned as a case-study contribution rather than a population estimate, and we are explicit about which numbers are illustrative versus measured.

TruPath Labs Research·12 min read·patents / legal-ai / drafting / qc / case-study

TPL-2026-009● preprintopen access2026-04-30

Plan-Mode Efficacy on Time-to-Merge: A Cross-Venture Study (n=42 tasks)

Plan mode — the agent-harness convention of producing and approving a written implementation plan before any code is written — is widely advocated as a discipline for non-trivial coding work, but its quantitative effect on time-to-merge has rarely been measured outside anecdote. We instrument 42 tasks shipped over a 10-week window across four ventures (Quantum Caddy, Mile High Golf, Parley, and TruPath cross-portfolio work), pair-matched by file-touch count and stack, and compare plan-mode tasks against direct-edit tasks on time-to-merge, post-merge rework, and operator-rated quality. Plan mode reduces median time-to-merge by 31% on tasks touching more than three files (n=22; p=0.011) but shows a null effect on tasks touching one to three files (n=20; p=0.74). The crossover point is between three and four files touched. Operator-rated post-merge rework drops from 38% of tasks to 14% in the plan-mode group on the >3-file bucket. We argue that plan mode is a structural intervention against context-window thrash, not a general-purpose productivity ritual, and that operators applying it uniformly to small tasks pay a real overhead with no measurable return.

TruPath Labs Research·12 min read·operations / plan-mode / productivity / methodology / cross-venture

TPL-2026-004● preprintopen access2026-04-30

Postmortem-Driven SOP Effectiveness: A 6-Month Recurrence Audit

Blameless postmortems are widely advocated as a structural-learning discipline, but their effectiveness depends on whether action items reach the mechanical layer (gates, hooks, contract criteria) rather than remaining as discipline-only commitments. We audit 26 postmortems shipped over 6 months across three ventures, classifying each action item as "mechanical" (enforced by code/config) or "discipline-only" (relies on operator memory) and measuring 90-day recurrence rates of the same incident class for each. Mechanical action items showed 4.2% recurrence (1 of 24); discipline-only action items showed 41.7% recurrence (10 of 24). The 10× gap is consistent across incident type, venture, and operator state. Postmortems whose action items remained at the discipline-only layer were no more effective at preventing recurrence than skipping the postmortem entirely.

TruPath Labs Research·10 min read·operations / postmortems / incident-management / methodology

TPL-2026-017● preprintopen access2026-04-30

Real-Estate Site-Search Automation: Hours-Saved and Filter-Accuracy in a Pre-Launch Venue Search

Pre-launch entertainment-venue site searches are constrained by an unusual cocktail of filters: zoning compatibility, square-footage and ceiling-height fit, ABC-license geography (in NC, alcohol licensing depends on municipal jurisdiction and proximity to schools/churches), and proximity to demand. We instrumented an 8-week site search for Mile High Golf, a pre-launch indoor-golf entertainment venue, comparing a manual baseline (commercial RE advisor working alone) against an agent-assisted pipeline (advisor + a Claude Code agent that ingests county GIS, ABC-licensing maps, and listing feeds; pre-filters; and ranks). The agent-assisted pipeline saved approximately 14 hours of advisor time per week at week 1, climbing to 22 hours per week by week 8 as filter calibration improved. Agent precision against advisor verdict started at 61% and reached 87% by week 8. Recall at agent-pass-through-to-advisor stayed above 95% throughout — the false-negative rate was the metric the advisor was unwilling to compromise. We discuss the limitations of single-venture, single-geography study and offer a reproducibility checklist.

TruPath Labs Research·12 min read·mhg / real-estate / automation / case-study

TPL-2026-002● preprintopen access2026-04-30

The Sprint Contract Effect: Measuring Overrun Reduction in Solo-Operator Engineering

Solo operators running agent-mediated engineering work routinely overrun sprint estimates by 50-200% of the originally scoped time. We measured the effect of adopting a structured sprint-contract discipline (pre-work definition of done, enumerated failure modes, gate-out acceptance criteria) on contract overrun rate across n=48 contracts shipped over 16 weeks by one operator across three ventures. Median overrun fell from 78% pre-discipline to 12% post-discipline (Wilcoxon signed-rank p < 0.001), with the largest reduction in contracts whose original definition of done was vague or testable only on completion. The discipline transferred fully across ventures (engineering, operations, legal) without per-domain tuning.

TruPath Labs Research·11 min read·operations / engineering-discipline / agent-mediated / methodology

TPL-2026-006● preprintopen access2026-04-30

Sub-agent ROI: When Spawning Pays Back and When It Doesn't

Sub-agent invocation — spawning a child agent to handle a bounded subtask — is increasingly common in solo-operator AI workflows, yet the conditions under which it produces a net benefit remain poorly characterized. We present a retrospective audit of n=180 sub-agent spawns drawn from 60 days of single-operator agent work across three active ventures. Each spawn was classified by task archetype (bounded research, parallelizable work, context-protection, and simple single-question) and annotated with an estimated overhead cost in tokens and wall-clock time. The headline result: 61% of observed spawns produced positive ROI (95% CI: 53–68%), but that aggregate masks sharp archetype-level divergence. Bounded-research spawns paid back at 84% (95% CI: 74–91%), parallelizable-work spawns at 79% (95% CI: 63–90%), context-protection spawns at 71% (95% CI: 56–83%), and single-question spawns at only 19% (95% CI: 9–33%). The dominant failure mode across negative-ROI spawns was sub-agent re-payment of priming cost already borne by the parent: the child reconstructs context the parent had already assembled, producing total cost exceeding the task value. A secondary failure mode was spawning for tasks below the 5k-token break-even threshold, where orchestration overhead exceeds any parallelism or context-isolation gain. We propose a four-condition routing rule, validated against a held-out set of 40 additional spawns, that reduces negative-ROI spawns from 39% to 14%. The routing rule and classification instrument are released under MIT license.

TruPath Labs Research·13 min read·operations / agent-orchestration / agent-mediated / methodology

TPL-2026-003● preprintopen access2026-04-30

Token Economics of Agent-Mediated Engineering: Per-Artifact Cost Distributions

Operators running agent-mediated engineering workflows routinely lack visibility into the per-artifact token cost of shipped work, treating the monthly API bill as a single aggregate figure. We instrument 92 days of single-operator agent usage across three concurrent ventures, computing per-shipped-artifact token cost, cache hit ratio, and the rate of "uncommitted-output days" (sessions consuming >5,000 tokens with no shipped artifact). Median cost per shipped artifact was 6,840 tokens (IQR 3,210-14,520) and showed strong dependence on cache hit ratio (Spearman ρ = -0.71, p < 0.001). Three discrete behavioral interventions — session batching, brief-drift detection, and a hard 30,000-token session cap — reduced median per-artifact cost by 38% over a 4-week intervention period.

TruPath Labs Research·12 min read·operations / tokens / agent-mediated / cost-engineering

Cadence: papers ship when the data is ready and the methodology is sound, not on a schedule. Subscribe to Field Notes for new-paper announcements.

Some forthcoming papers will be member-only or paid once Quantum Caddy provisional patents are filed and the paywall is wired. Open-access pieces stay open.