Hardware-in-Loop Agent Failure Modes: 60 Days of ESP32 + CV + LLM Co-Development
Abstract
When LLM agents simultaneously drive firmware authorship (ESP32), computer-vision pipeline evolution (RT-DETR / YOLO11 on Jetson Orin), and sensor-layer calibration, failures do not distribute uniformly. We catalog 94 agent-attributed failures across 60 days of the Quantum Caddy smart-board development cycle and classify them into six categories: context-collapse, hardware-assumption drift, calibration-loop divergence, firmware-CV protocol mismatch, sensor-signal hallucination, and safety-boundary erosion. Mean-time-to-resolution (MTTR) ranges from 18 minutes (sensor-signal hallucination) to 4.2 hours (firmware-CV protocol mismatch). Cost-per-class spans $0.11 to $2.40 per incident. Context-collapse and protocol-mismatch account for 61% of total remediation cost despite representing only 38% of incident count. All numbers are illustrative, synthesized from the QC development log; see §6 Limitations.
1. Introduction
Hardware-in-loop (HIL) testing has a long history in automotive and aerospace control systems [1]. In that tradition, a physical or simulated plant is connected to the controller under test so that timing, signal integrity, and failure modes can be exercised before the controller is deployed to the real world. The discipline assumes a well-defined interface between controller and plant, a stable hardware model, and a human engineer who can reason about both sides of the boundary at once.
LLM-agent-mediated development introduces a third party into that loop — an agent that is simultaneously authoring the controller firmware, tuning the plant model (sensor calibration), and evolving the CV pipeline that interprets the plant’s physical outputs. The interface between these layers is no longer static; it is a moving target that the same agent is modifying in all three directions at once. The human engineer is still in the loop, but the agent’s context window, not the engineer’s working memory, is now the binding constraint on coherence.
The Quantum Caddy (QC) smart-board project provided an unusually clean case study for observing this failure surface. The hardware stack is the ESP32-WROOM-32D running custom firmware over a WebSocket protocol on port 81 [11], four 50 kg load cells read via the HX711 ADC in a Wheatstone bridge configuration [10], and an NVIDIA Jetson Orin Nano [5] running RT-DETR [3] for ball and bag detection. Sensor strategy pivoted mid-development from FSR/load-cell to millimeter-wave radar and IR time-of-flight after an embedded-systems design review surfaced physics assumptions that had never been bench-tested [12]. That pivot created a natural experiment: the same agent-mediated dev loop continued with a new hardware layer, and the failure taxonomy shifted in measurable ways.
We catalog 94 agent-attributed incidents over 60 days and classify them into six categories. We measure frequency, mean-time-to-resolution (MTTR), and estimated cost per class. We identify the mitigations that produced the largest incident-rate reductions and propose a generalized HIL discipline for operators running LLM agents over multi-layer hardware stacks.
The rest of TPL-2026-012 is for subscribers.
Hardware-in-Loop Agent Failure Modes: 60 Days of ESP32 + CV + LLM Co-Development
- Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
- Every research paper — methodology, figures, tables, reproducibility appendices
- New Expert lessons + papers as they ship (quarterly cadence)
- Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff
Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.