Same Task, Three Harnesses: A Throughput-Quality-Cost Benchmark of Claude Code, Codex, and Cursor on a Multi-Stage CV Pipeline Task
Abstract
We report a case study — n=1 task, three agent harnesses, single operator — comparing Claude Code, OpenAI Codex CLI, and Cursor on the construction of a multi-stage computer-vision evaluation pipeline. The task was held constant across all three sessions: implement a pipeline that measures per-stage detection coverage on a held-out image sequence, accepting three callable detectors and producing a metrics report. Harness was the only variable. We measured time-to-working-artifact (minutes), total token cost, code-review score across four axes (correctness, clarity, brief-fidelity, test-coverage), and operator-intervention count. This is a case study, not a population-level claim. The contribution is a repeatable methodology for single-operator harness comparisons and one data point. Our results showed meaningful differences across harnesses on throughput-versus-quality tradeoffs that were not predicted by the harnesses' respective positioning. Claude Code showed the strongest brief-fidelity and lowest intervention count; Cursor completed the core artifact fastest per token; Codex produced the most self-contained, portable output. All three harnesses shipped a working pipeline. The appropriate interpretation is that each harness exhibits distinct ergonomic patterns favoring different operator contexts — not that any harness dominates. We provide the task brief, evaluation rubric, and scoring procedure in sufficient detail for independent replication.
1. Introduction
Practitioners building real systems with AI coding assistants now have a proliferating menu of harnesses to choose from. Claude Code [4], OpenAI’s Codex CLI [5], and Cursor [6] each offer a different model for how an operator interacts with an AI agent during an engineering session. The harnesses differ in their interaction modalities (terminal vs. IDE), their context models (file-system-aware vs. editor-buffer-aware), and their default behaviors around tool use, file creation, and clarification-seeking.
The academic literature on code-generation evaluation (HumanEval [1], SWE-bench [2]) focuses on model capabilities in isolation — a fixed prompt, a measurable output, a pass/fail gate. These benchmarks are valuable for model comparison but do not capture the ergonomics of working within a harness across a multi-step session: how often the operator needs to redirect, how faithfully the final artifact matches the original brief, and how much operator time is consumed by the interaction protocol itself.
This study attempts a narrow but reproducible comparison across those ergonomic dimensions. We gave all three harnesses the same engineering task — build a multi-stage CV evaluation pipeline — under controlled conditions and measured four outcomes: time-to-working-artifact, total token consumption, code quality across four axes, and operator-intervention count.
This is a case study, not a benchmark in the population-statistics sense. n=1 task, 3 sub-task replications per harness, single operator. The appropriate contribution claim is: here is a methodology for this kind of comparison, and here is one data point produced with it. Practitioners should weight this data point lightly and replicate on tasks and operators more representative of their own context.
The rest of TPL-2026-008 is for subscribers.
Same Task, Three Harnesses: A Throughput-Quality-Cost Benchmark of Claude Code, Codex, and Cursor on a Multi-Stage CV Pipeline Task
- Every Expert-tier lesson — diagnostic prompts, transcripts, prompt kits, full homework
- Every research paper — methodology, figures, tables, reproducibility appendices
- New Expert lessons + papers as they ship (quarterly cadence)
- Foundations + Operating lessons stay free; bundles on GitHub stay free; this tier is the deep stuff
Free while the early catalog ships. Paid tier comes later — subscribe now and you’re grandfathered in.