Harness racing

The first time you race two harnesses against the same brief, you’ll do it because you couldn’t decide whether Codex or Claude Code is better at your codebase. The second time, you’ll do it because Aider was 40% faster and the diff was cleaner — and you want to know which harness wins on what kind of task. By the tenth time, you’re racing because it’s cheaper than guessing.

A harness in SprintLoop is the agent runtime that executes a brief: Claude Code, Codex CLI, Cursor Compose, Aider, Devin, or any harness you connect via the harness adapter API. Racing means dispatching the same brief to two or more harnesses in parallel inside sibling lanes, letting them all complete (or get killed by the first sign-off), and picking a winner via a diff comparator instead of vibes.

Why race at all

Three reasons that survive a serious cost analysis:

Quality. Different harnesses make different mistakes. Claude tends to over-explain in commit messages and under-test edge cases; Codex tends to under-comment and over-refactor; Aider is fast and surgical but stalls on ambiguous briefs. The winner picker doesn’t care about those stylistic differences — it scores on diff size, test coverage delta, lint passes, and Review Committee verdicts. You get the best of the field on the runs where it matters.
Speed. A racing dispatch finishes when the first harness produces a diff that passes the floor (CI green, no blocking reviewer verdict). On parallel work that’s typically 30–50% faster than running the harnesses sequentially and falling back when one fails.
Cost containment. This is the counterintuitive one. You’d think racing costs N times more. In practice, the second-place harness gets killed before it consumes most of its budget — the median canceled lane in our telemetry uses 18% of the winner’s tokens. So a 3-way race tends to cost 1.4–1.6× a single dispatch, not 3×.

The case against racing is that it’s wasteful when the brief is trivial (“rename this variable”) or when you trust one harness for a specific kind of work (“Aider always wins on tiny refactors in this repo”). That’s fine — racing is opt-in per dispatch, and the workspace remembers which briefs you raced and which you didn’t.

How the comparator picks a winner

Each finished lane in a race exits with a score. Scores are linear combinations of:

Signal	Weight	Source
CI green	0.30	GitHub / GitLab status checks
Reviewer verdict	0.25	Review Committee (Architect, Security, QA)
Diff size delta	0.15	LOC added/removed vs. brief estimate
Test coverage delta	0.15	Coverage tool output (jest, pytest, go test)
Lint clean	0.08	Whatever linter the repo declares
Wall time	0.07	First-to-finish bonus (decays after 90s)

Weights are configurable per workspace under Settings → Racing. The defaults above are tuned on six months of internal telemetry across roughly 14,000 races; teams that ship to regulated industries usually push the reviewer verdict weight up and the diff-size weight down.

The comparator runs as soon as the first lane in the race signals “ready for review.” Ties (within 5%) surface a picker — the human dispatcher gets a side-by-side diff view and chooses. Anything outside the tie band auto-promotes the winner and cancels the losing siblings.

What gets canceled, what gets kept

When the comparator picks a winner, sibling lanes transition to canceled and their scope claims release immediately. Their tool logs and signed entries stay on the audit ledger forever — racing without a paper trail would be the worst kind of waste — and you can pull up a canceled lane’s diff to inspect what the loser tried.

Two things to know about the canceled siblings:

Their commits are not pushed. A canceled lane never opens a PR. The branch exists locally on SprintLoop’s worker, gets archived, and is purged after 30 days unless you tag it.
Their tool calls and model spend stay on your bill. Cancellation prevents new token consumption, not retroactive recovery of what’s already been spent. This is why the comparator’s first-to-finish bonus matters — fast losers cost less than slow losers.

Common races and when to use them

The set of races we see most often:

Claude Code vs. Codex on a feature implementation. Default for non-trivial work. The two harnesses have meaningfully different strengths — Claude on architecture, Codex on tight, idiomatic implementations — and the comparator usually breaks the tie cleanly.
Aider vs. Codex on a small refactor. When you know the work is mechanical, Aider’s smaller surface area often wins on speed; Codex provides a sanity check that the refactor didn’t drift.
Three-way for a hairy bug. Claude Code, Codex, Cursor Compose. Used sparingly — usually after a single-harness dispatch already failed.
One harness, two prompts. Same harness, different briefs. Mostly used for prompt-engineering experiments — useful, but you should turn it off once you’ve picked the winning prompt.

Costs you’ll actually see

A 3-way race on a 200-line feature typically lands in the $0.40–$1.20 range with current model pricing — your mileage depends on which harnesses you race and how aggressive your maxTokens setting is. The losing lanes are killed by the comparator within a few seconds of the winner’s CI signal, so the long tail of token spend is bounded.

Workspace settings include a per-day racing budget. When you exceed it, future races degrade to single-harness dispatch and surface a banner. This is a soft control — you can override it inline — but it’s saved a few teams from a runaway loop where every brief got raced because someone left the default on.

Setting up your first race

Cmd+N to open the dispatch dialog. Below the harness picker, toggle Race. Pick two or three harnesses (you need a key for each — see Get started). Set the budget cap if your workspace has one. Dispatch.

The dispatch returns N sibling lane IDs. Each renders as its own card in the Lanes panel; the cards are visually grouped under a “Race” header with a live winner indicator. Click any card to drop into that lane’s tool log and diff.

Once a winner is picked, the race header collapses into the winner’s lane card and the rest archive into a “Canceled siblings” disclosure. Click through to inspect any of them after the fact.