Teardown2026-05-178 min

We put a language model inside a hardware device. Here's every decision we made.

LLMs in real-time hardware aren't ChatGPT. Latency budget is the constraint that changes everything.

We put a language model inside a hardware device. Here's every decision we made.

Quantum Caddy is a hardware product. Smart board with embedded sensors. Two cameras. Real-time computer vision pipeline. The product narrates your cornhole game as you play it — score, zone analysis, coaching — in under 200 milliseconds from the moment a bag settles.

That last requirement is the one that broke six consecutive training runs and forced us to rearchitect the entire LLM integration from scratch.

The original design used a language model for everything. Bag lands. Vision pipeline detects it. LLM gets a prompt with the world state and produces commentary. It sounded clean. It didn't work.

The latency problem

The problem is latency arithmetic. A decent local model on our Mac Studio M4 takes 600 to 1000 milliseconds just to start producing output. Then Kokoro TTS needs another 200 milliseconds to play the first audio chunk. You're already past a second before the player hears anything. For a casual coaching question, that's fine. For a play-by-play call the moment a bag hits the board, it's not. The throw has been over for a full second. The player is already walking to the board. The moment has passed.

The deeper problem was hallucination. During LLM training cycles v4 through v6, the model consistently failed to ground bag zone information correctly. It would say "nice shot, zone 3" when the bag landed in zone 7. The model was making reasonable-sounding commentary that didn't match the actual board state. Six days of training, six different model versions, the same fundamental error. The model would learn the zone vocabulary but not the discipline to stay grounded to the live JSON it was receiving.

The architectural fix

The fix was architectural. We split the speech path in two.

Throw path: no LLM. A bag lands, the vision pipeline hands off to the PolicyEngine, the PolicyEngine selects a template from the CommentaryTemplateBank, Kokoro TTS fires. Total latency under 200 milliseconds. Zero hallucination. Zone accuracy 7/7 in the test battery. The model never touches this path.

Conversation path: LLM only. A player says something or the UI sends a message to the `/message` endpoint. That goes to qcaddy_brain.py, which loads Gemma v7 with the current WorldState JSON and ChromaDB retrieval context, and generates a response. Latency here is 600 to 1000 milliseconds for first token, then streaming audio. That's acceptable for conversation. It's not acceptable for play-by-play.

This is the April 10 template breakthrough. It sounds obvious in retrospect. It took six weeks to arrive at.

Why Gemma, why MLX, why Mac Studio

The choice of Gemma 4 E2B at 2.3 billion parameters was a latency decision, not a capability decision. A bigger model would produce better responses. A bigger model running on the same hardware would also blow the latency budget and thermal envelope. We're using MLX for 4-bit quantized inference, which keeps the model weights at about 1.5 gigabytes in unified memory. That leaves enough headroom for the vision pipeline and PerceptionRuntime to run in parallel without thermal throttling. On Apple Silicon there's no GPU/CPU memory boundary — the M4's unified memory architecture means the model and the CV pipeline share the same pool without copy overhead. That's why Mac Studio was the right runtime and CUDA was never on the table.

Structured output over free prose

The structured output decision mattered more than I expected. Every prompt to the LLM requests JSON back: a score string, a zone label, a coaching fragment — not free prose. The enforced structure prevented the model from hallucinating bag counts into flowing paragraphs. When the output has a schema, it fails loudly instead of quietly. A zone label field returning "zone 12" when there are only 8 zones is detectable. A bag-count field returning 7 when the board shows 3 can be validated against the vision pipeline state. Structured output gave us an error surface. Free prose gave us plausible lies.

The RAG setup was essential for the coaching path. Seven ChromaDB collections hold the cornhole knowledge base: rules, throw technique, tournament strategy, session history. The model doesn't know what it knows until retrieval routes the right context into the prompt. Without it, Gemma v7 produces reasonable-sounding but cornhole-specific-wrong coaching. With it, the responses are grounded.

What I'd do differently

The voice pipeline deserves more attention than we gave it initially. Faster-Whisper with Silero VAD for input, XTTS v2 streaming for output — both good choices. But we treated the ASR and TTS as plumbing for months before we measured the real latency contribution. Faster-Whisper adds 300 milliseconds including silence detection. Kokoro TTS is another 200 milliseconds. Those aren't free. The total pipeline from bag landing to first audio is over a second on the conversation path. That's acceptable, barely. On slower hardware it wouldn't be.

The other thing I'd change is the training evaluation harness. We trained nine model versions before we discovered that our eval script was reading live board state instead of injecting synthetic test state. So v9 scoring 2/11 on the eval battery wasn't a model failure — it was an eval failure. Seven of the eleven questions required game state that the board didn't have. The production model would have scored the same. We wasted a 15-hour RunPod run and $0.45 because nobody ran the eval against the current production model before training.

Run the eval against prod first. If prod also fails, the eval is broken. Never train until the eval is honest.

--- *Filed from the QC lab. The model is the smallest part of the problem — the latency budget is the whole architecture.*