A determinism budget — how we decide which tests to replay

tl;dr

Replaying every captured request against every PR is a quadratic problem. We took a different tack: pick a fixed per-PR replay budget (initially 90 seconds), let the budget shape the system, and treat which tests get replayed as a learning problem.

Marcus Chen

Staff Eng · codingassist.bot · @mchen_eng

February 22, 2026 9 min read 1.9k readsIntermediate

The naive version of replay is straightforward: capture production traffic for a week, then for every PR, replay the whole bundle against the candidate diff and the base branch. Compare. Done.

It works for the first ten engineers and zero PRs per hour. Then it stops working.

The shape of the problem

Each captured request is on the order of a few KB; we keep ~10⁵ per service. A full replay is single-digit seconds per request because we instrument the runtime, not the network. So a brute-force replay of every PR against every captured request is n × m, where n is your traffic-week and m is your PR rate.

For a healthy mid-sized service:

n =  100,000 requests
m =       40 PRs / day
total =  4,000,000 replays / day
        ≈ 46 replays / second

That works on a beefy host until your service grows into seven services, and then you're priced out.

Pick a budget

We swapped "replay everything" for "replay until you're confident." Each PR gets a 90-second wall-clock budget for trace replay. The system picks the most informative subset of requests to fit inside.

The ranking model has three inputs:

Diff-touched code paths — find call sites in the diff, weight requests that exercise them.
Historical bug correlation — requests that have surfaced bugs before are more valuable.
Diversity — explore corners of the request space we don't normally see.

The output is a ranked list. The replay scheduler walks the list, runs each request, and stops when the budget is exhausted or the diff's behavioral signature stabilises (no new behavioral diff in the last N replays).

What the constraint bought us

Three things, none of which we set out to design:

Predictable PR cycle time. Reviewers know within 90 seconds whether a trace flag is going to fire. The codingassist.bot comment lands consistently after the first CI check.
A clear knob for cost vs. confidence. Tenants who want more thoroughness pay for a bigger budget. Open-source teams use the free tier with a 30-second budget and accept that obscure paths take longer to flag.
A real problem to optimise. "Pick the best 90 seconds" is a learning problem. We can A/B-test ranking strategies against historical incident data and have a real metric.

The brute-force version had no knob. Our cost was your traffic, your knob was your wallet. That ends badly for both sides.

What it cost

It cost us some coverage, mathematically. A bug that only surfaces in the 91st second of replay never fires. We tried to measure this; on six months of historical incident data, the budget caught 96% of what brute-force would have caught.

If you're building anything similar, the lesson is: pick the budget early. Letting the budget exist for two years and then trying to retrofit it is a much harder problem than designing for it from the start.

↓ Continue

A determinism budget — how we decide which tests to replay

The shape of the problem

Pick a budget

What the constraint bought us

What it cost

Read next

Introducing codingassist.bot — deterministic code review

Six planes, one axiom — the architecture behind deterministic verdicts

Triple Context, explained without the marketing

A determinism budget — how we decide which tests to replay

The shape of the problem

Pick a budget

What the constraint bought us

What it cost

Read next

Introducing codingassist.bot — deterministic code review

Six planes, one axiom — the architecture behind deterministic verdicts

Triple Context, explained without the marketing