What 200 incident-PRs taught us about review fatigue

Loading article…

In 2024 we collected 200 pull requests across 31 teams that each shipped a real production incident. For every PR we walked back through the reviewer comments, the CI logs, the postmortem, and — where we could — interviewed the reviewer.

Three things came out of that data. The first two were what we expected. The third surprised us.

What we expected

1. Reviewer attention drops fast within a session. First PR a reviewer sees in a session catches 1.8× more issues than the third PR in the same session. By the fifth PR the catch rate has plateaued. This is well-documented; we just confirmed it.

2. Larger diffs attract less feedback per line. Past 200 lines changed, comment density collapses. Reviewers either skim or hand-wave or sign off and move on. Also expected.

What we didn't expect

3. Time of day is the strongest predictor of reviewer miss-rate. PRs reviewed between 11:00 and 14:00 local time have a 2.4× lower incident rate than PRs reviewed between 15:00 and 18:00. Even controlling for reviewer experience, diff size, and team. The afternoon is the bug-shaped hole in your review process.

This was surprising enough that we asked the dataset what was different about post-lunch reviews. The diffs themselves weren't bigger. The reviewers weren't junior. The PR descriptions weren't worse. The thing that was different was comment specificity: morning reviewers asked questions about behavior; afternoon reviewers commented on style.

Why this matters for tooling

Most automated review tools optimise for "comprehensiveness" — surface every issue, every time. The data suggests this is the wrong target. Reviewers don't have unlimited attention; surfacing 14 issues at 4 p.m. doesn't get you 14 fixed issues, it gets you 2 fixed issues and 12 ignored ones.

The lesson we took into codingassist.bot: a tool that produces three high-confidence, behavior-relevant signals is worth more than one that produces fourteen mixed-quality findings. The comment density isn't the metric — the fix-rate is. Optimising for that means giving the reviewer a small number of things they cannot easily ignore, especially after lunch.

Methodology, briefly

200 PRs, 31 teams, 9 industries, 14 months of postmortem data
Every PR had a published incident retro within 30 days of merge
Reviewer interviews ran 25–40 minutes; we recorded but did not transcribe
The full dataset (anonymised) is available to academic partners on request

What 200 incident-PRs taught us about review fatigue

tl;dr

We collected 200 PRs that shipped a real incident and walked back through every reviewer comment. Three things surprised us — including how predictable the bugs that get through are once you control for time-of-day.

Yuki Tanaka

Research · codingassist.bot · @yukitr

March 4, 2026 11 min read 2.4k readsIntermediate

What we expected

2. Larger diffs attract less feedback per line. Past 200 lines changed, comment density collapses. Reviewers either skim or hand-wave or sign off and move on. Also expected.

What we didn't expect

Why this matters for tooling

What we expected

What we didn't expect

Why this matters for tooling

Methodology, briefly

Read next

Introducing codingassist.bot — deterministic code review

Six planes, one axiom — the architecture behind deterministic verdicts

Triple Context, explained without the marketing

What 200 incident-PRs taught us about review fatigue

What we expected

What we didn't expect

Why this matters for tooling

Methodology, briefly

Read next

Introducing codingassist.bot — deterministic code review

Six planes, one axiom — the architecture behind deterministic verdicts

Triple Context, explained without the marketing