The naive version of replay is straightforward: capture production traffic for a week, then for every PR, replay the whole bundle against the candidate diff and the base branch. Compare. Done.
It works for the first ten engineers and zero PRs per hour. Then it stops working.
The shape of the problem
Each captured request is on the order of a few KB; we keep ~10⁵ per service. A full replay is single-digit seconds per request because we instrument the runtime, not the network. So a brute-force replay of every PR against every captured request is n × m, where n is your traffic-week and m is your PR rate.
For a healthy mid-sized service:
n = 100,000 requests
m = 40 PRs / day
total = 4,000,000 replays / day
≈ 46 replays / secondThat works on a beefy host until your service grows into seven services, and then you're priced out.
Pick a budget
We swapped "replay everything" for "replay until you're confident." Each PR gets a 90-second wall-clock budget for trace replay. The system picks the most informative subset of requests to fit inside.
The ranking model has three inputs:
- Diff-touched code paths — find call sites in the diff, weight requests that exercise them.
- Historical bug correlation — requests that have surfaced bugs before are more valuable.
- Diversity — explore corners of the request space we don't normally see.
The output is a ranked list. The replay scheduler walks the list, runs each request, and stops when the budget is exhausted or the diff's behavioral signature stabilises (no new behavioral diff in the last N replays).
What the constraint bought us
Three things, none of which we set out to design:
- Predictable PR cycle time. Reviewers know within 90 seconds whether a trace flag is going to fire. The codingassist.bot comment lands consistently after the first CI check.
- A clear knob for cost vs. confidence. Tenants who want more thoroughness pay for a bigger budget. Open-source teams use the free tier with a 30-second budget and accept that obscure paths take longer to flag.
- A real problem to optimise. "Pick the best 90 seconds" is a learning problem. We can A/B-test ranking strategies against historical incident data and have a real metric.
The brute-force version had no knob. Our cost was your traffic, your knob was your wallet. That ends badly for both sides.
What it cost
It cost us some coverage, mathematically. A bug that only surfaces in the 91st second of replay never fires. We tried to measure this; on six months of historical incident data, the budget caught 96% of what brute-force would have caught.
If you're building anything similar, the lesson is: pick the budget early. Letting the budget exist for two years and then trying to retrofit it is a much harder problem than designing for it from the start.