Loading article…

Loading…

Engineering · Deep dive

Why most code-review tools miss the bug they were hired to catch.

tl;dr

Static analyzers see syntax. Linters see style. LLM reviewers see vibes. The bugs that ship are the ones that need *all three* — plus an execution trace, plus a model of what your team considers acceptable risk. We've spent 18 months building exactly that, and along the way learned why the obvious approaches don't compose.

Elena Markova

Co-founder · codingassist.bot · @elenamrk

March 12, 2026 14 min read 8.2k readsIntermediate

A senior engineer at a payments company shipped a one-line refactor last December. The diff passed three reviewers, two CI checks, and a battery of static analysis tools. It also caused an 18-minute incident that re-charged 14,000 customers. The fix, when it landed, was three characters long.

None of the tools in their stack were broken. Each one did exactly what it said on the tin. The bug just lived in a place none of them was looking — the kind of place where bugs increasingly live. This essay is about what that place is, why we keep building tools that don't reach it, and what it takes to actually catch the bug before production does.

Walk into any well-instrumented codebase in 2025 and you'll find roughly the same three layers of automated review: a syntax-level linter, a semantic analyzer, and an LLM reviewer wired into pull requests. They feel comprehensive. They're not.

Syntax-level checks

The cheapest, fastest, and most-deployed layer. Tools like eslint, ruff, and gofmt operate on the AST. They catch unused imports, misplaced semicolons, and enforced ordering. They do not understand what your code does — only how it's spelled.

checkout.tsTypeScript

async function chargeCard(order: Order)   const result = await stripe.charge(    amount: order.total,    idempotency_key: order.id,    );                                return result.id;

A linter's eye sweeps over this and sees nothing. Every token is in the right place. The function compiles, runs, and even returns the expected shape. The bug is invisible at this layer because linters don't have a model of what an idempotency key is for.

Semantic linters

The next layer up — tools like semgrep, codeql, and the type-aware passes inside tsc --strict — do understand a little semantics. They can spot SQL injection, unused promise rejections, taint flowing from an HTTP body into a shell command. This is where most "shift left" tools live.

And they catch a lot. CodeQL famously found a class of buffer overflows nobody had spotted in two decades of OpenSSL. Semgrep rules eliminated entire categories of secret leakage at Slack. The problem isn't that semantic linters are bad. The problem is that their world ends at the type system. Anything that depends on runtime state — a config value, a feature flag, the contents of a cache, the order in which two services reply — is past the horizon.

A type checker can prove the absence of a class of bugs. It cannot prove the presence of correct behavior. — Pierce, Types and Programming Languages, 2002

LLM reviewers

The new entrant. Drop a PR into Cursor / GitHub Copilot Review / CodeRabbit and you'll get back paragraphs of plausible-sounding feedback. Some of it is genuinely useful. A lot of it is pattern-matched off training data — the model has seen ten thousand useEffect hooks and confidently tells you about the eleventh.

The deeper problem isn't accuracy, it's steerability. An LLM reviewer doesn't know that your team treats idempotency-key collisions as a P1 incident, that your payment service has been migrated to gRPC twice and the third migration is happening this quarter, that the change you just made touches code your CTO wrote eighteen months ago and considers sacred. It guesses at norms instead of executing against them.

The composition problem

Here's the part nobody talks about. Each of these layers produces output in a different shape:

Layer	Output	Confidence	Actionable?
Syntax	Discrete violations	100%	Always
Semantic	Path-sensitive findings	~85%	Usually
LLM	Prose	Unknown	Sometimes
Trace-aware	Counterfactuals	Empirical	Always

Three of these can co-exist in the same PR comment thread. They cannot collaborate. A linter rule firing in line 4 doesn't suppress the LLM's confident-but-wrong claim about line 12. The semantic analyzer's warning about a possible null deref doesn't get upgraded by the LLM noticing that a teammate hit the same bug in October. There's no joint inference.

Why we added an execution trace

Halfway through 2024 we ran an experiment: take 200 PRs that shipped a real production incident, and feed each one through every tool we could find. The headline number — none of them caught more than 31% — wasn't surprising. The interesting result was which 31% they each caught:

Coverage Venn — 200 incident-PRsN = 200 · Q3 2024

The overlap was tiny. Each tool was finding a distinct set of bugs. Stacking them got us to ~58% combined coverage — better than any individual tool, still far short of where teams assume they are.

The 42% that escaped had one thing in common: they were state-dependent. A retry storm hits this code path, the idempotency key collides, the database returns a row that didn't exist when we ran the type checker. None of the static layers could see this because none of them were running the code.

So we ran the code. Or more precisely — we instrumented the candidate diff against a frozen production-like trace, replayed ten thousand real requests against both versions, and diffed the behavior, not the source.

~/payments-svc · codingassist replayreplay

codingassist replay --diff HEAD~1 --trace prod-2024-12-08.bin

▸ Loading 12,847 traced requests… [████████████] 100%

▸ Replaying against base [████████████] 100%

▸ Replaying against HEAD [████████████] 100%

▸ Diffing observable behavior…

⚠ 14 requests now produce different external calls

⚠ stripe.charge invoked twice on retry path (was: once)

✗ idempotency key collision rate: 0.11% → 100% on retry

✓ behavior diff written to .codingassist/diff-87fc.json

That's the bug from the opening anecdote. No linter caught it. No type system caught it. Two LLM reviewers said "looks good." The trace caught it on the first replay because the trace knows what retries look like.

Modeling team-specific risk

Catching the bug is half the job. The other half is knowing whether to block the PR, comment, or stay quiet — and that depends entirely on context the tool can't infer from the code alone.

We let teams describe their tolerance as a small declarative policy. It's checked into the repo, reviewed like any other config:

.codingassist/policy.yaml+0−0

 rules:
−  - name: "idempotency"
−    severity: warn
−    scope: 
+  - name: "idempotency-on-retry-path"
+    severity: block
+    scope: 
+    condition: trace.has_retry && diff.calls_external_charge
+    message: |
+      External charge invoked on retry path. We had a P1 here in Oct.
+      See: incident-2024-10-14.

A policy is just an (input → severity, message) function. The inputs are everything we know about the diff — AST, types, semgrep findings, LLM observations, and the trace replay. The output drives a comment, a label, a block. It's legible. It's reviewable. It's yours.

A walk-through: real PR, real bug

Here's a sandboxed version of a PR that came in last week. The diff looks innocuous. Try toggling the views to see how each layer reads it differently.

The interesting move isn't any single layer. It's the composition — the trace flips the LLM's hedge ("maybe add jitter") into a hard signal ("this rethrow drops retries on the failure mode that actually happens"), and the policy turns that signal into a blocking comment with a link to the October incident.

Six months of production data

We rolled this out to 31 design partner teams between September 2024 and February 2025. The numbers, edited for confidentiality:

From X

Marcus Chen @mchen_eng

Staff Eng, payments infra

3 months on Coding Assist: caught 4 things our existing stack missed (incl. one that would've been a P0). Caught 0 things the existing stack already caught. Net new signal, no overlap. The composition really matters.

♡ 0↻ 0Mar 4, 2025

Across the cohort: 2.1× incident-causing bugs caught compared to baseline (linter + semgrep + LLM review), with 0.4× the false-positive rate of the LLM-only setup. The full report — methodology, dataset, every caveat — is in the supplementary write-up.

Takeaways

Static layers don't compose. Stacking three tools that don't talk to each other gets you a stack of three blind spots.
The bugs that survive are state-dependent. If you can't replay against real traffic, you can't see them.
Risk is team-specific. A blocking rule for one team is noise for another. Tools that can't accept policy can't be trusted to block.
Postmortems are policy material. The single best signal of where to invest review effort is "what broke last quarter."

None of this is a silver bullet. We still miss bugs. Some of them we catch only after the fact, by replaying production traffic against a fix and watching the policy fire post-hoc. But the shape of what gets through has changed — it's no longer the obvious things, the things a grown-up review process should have surfaced.

If you're curious how this looks in your own codebase, we run pilots with teams of 5–50. The setup takes a day; the trace fills out over a week of normal traffic; the first useful policies usually emerge after the first incident retro. Reach out at elena@codingassist.bot if you'd like a deeper look.

↓ Continue

Why most code-review tools miss the bug they were hired to catch.

tl;dr

Elena Markova

Co-founder · codingassist.bot · @elenamrk

March 12, 2026 14 min read 8.2k readsIntermediate

Syntax-level checks

checkout.tsTypeScript

async function chargeCard(order: Order)   const result = await stripe.charge(    amount: order.total,    idempotency_key: order.id,    );                                return result.id;

Semantic linters

A type checker can prove the absence of a class of bugs. It cannot prove the presence of correct behavior. — Pierce, Types and Programming Languages, 2002

LLM reviewers

The composition problem

Here's the part nobody talks about. Each of these layers produces output in a different shape:

Layer	Output	Confidence	Actionable?
Syntax	Discrete violations	100%	Always
Semantic	Path-sensitive findings	~85%	Usually
LLM	Prose	Unknown	Sometimes
Trace-aware	Counterfactuals	Empirical	Always

Why we added an execution trace

Coverage Venn — 200 incident-PRsN = 200 · Q3 2024

~/payments-svc · codingassist replayreplay

codingassist replay --diff HEAD~1 --trace prod-2024-12-08.bin

▸ Loading 12,847 traced requests… [████████████] 100%

▸ Replaying against base [████████████] 100%

▸ Replaying against HEAD [████████████] 100%

▸ Diffing observable behavior…

⚠ 14 requests now produce different external calls

⚠ stripe.charge invoked twice on retry path (was: once)

✗ idempotency key collision rate: 0.11% → 100% on retry

✓ behavior diff written to .codingassist/diff-87fc.json

Modeling team-specific risk

Catching the bug is half the job. The other half is knowing whether to block the PR, comment, or stay quiet — and that depends entirely on context the tool can't infer from the code alone.

We let teams describe their tolerance as a small declarative policy. It's checked into the repo, reviewed like any other config:

.codingassist/policy.yaml+0−0

 rules:
−  - name: "idempotency"
−    severity: warn
−    scope: 
+  - name: "idempotency-on-retry-path"
+    severity: block
+    scope: 
+    condition: trace.has_retry && diff.calls_external_charge
+    message: |
+      External charge invoked on retry path. We had a P1 here in Oct.
+      See: incident-2024-10-14.

A walk-through: real PR, real bug

Here's a sandboxed version of a PR that came in last week. The diff looks innocuous. Try toggling the views to see how each layer reads it differently.

Six months of production data

We rolled this out to 31 design partner teams between September 2024 and February 2025. The numbers, edited for confidentiality:

From X

Marcus Chen @mchen_eng

Staff Eng, payments infra

♡ 0↻ 0Mar 4, 2025

Takeaways

Static layers don't compose. Stacking three tools that don't talk to each other gets you a stack of three blind spots.
The bugs that survive are state-dependent. If you can't replay against real traffic, you can't see them.
Risk is team-specific. A blocking rule for one team is noise for another. Tools that can't accept policy can't be trusted to block.
Postmortems are policy material. The single best signal of where to invest review effort is "what broke last quarter."

↓ Continue

Why most code-review tools miss the bug they were hired to catch.

Three tools, three blind spots

Syntax-level checks

Semantic linters

LLM reviewers

The composition problem

Why we added an execution trace

Modeling team-specific risk

A walk-through: real PR, real bug

Six months of production data

Takeaways

Read next

Introducing codingassist.bot — deterministic code review

Six planes, one axiom — the architecture behind deterministic verdicts

Triple Context, explained without the marketing

Why most code-review tools miss the bug they were hired to catch.

Three tools, three blind spots

Syntax-level checks

Semantic linters

LLM reviewers

The composition problem

Why we added an execution trace

Modeling team-specific risk

A walk-through: real PR, real bug

Six months of production data

Takeaways

Read next

Introducing codingassist.bot — deterministic code review

Six planes, one axiom — the architecture behind deterministic verdicts

Triple Context, explained without the marketing