A senior engineer at a payments company shipped a one-line refactor last December. The diff passed three reviewers, two CI checks, and a battery of static analysis tools. It also caused an 18-minute incident that re-charged 14,000 customers. The fix, when it landed, was three characters long.
None of the tools in their stack were broken. Each one did exactly what it said on the tin. The bug just lived in a place none of them was looking — the kind of place where bugs increasingly live. This essay is about what that place is, why we keep building tools that don't reach it, and what it takes to actually catch the bug before production does.
Three tools, three blind spots
Walk into any well-instrumented codebase in 2025 and you'll find roughly the same three layers of automated review: a syntax-level linter, a semantic analyzer, and an LLM reviewer wired into pull requests. They feel comprehensive. They're not.
Syntax-level checks
The cheapest, fastest, and most-deployed layer. Tools like eslint, ruff, and gofmt operate on the AST. They catch unused imports, misplaced semicolons, and enforced ordering. They do not understand what your code does — only how it's spelled.
async function chargeCard(order: Order) const result = await stripe.charge( amount: order.total, idempotency_key: order.id, ); return result.id;
A linter's eye sweeps over this and sees nothing. Every token is in the right place. The function compiles, runs, and even returns the expected shape. The bug is invisible at this layer because linters don't have a model of what an idempotency key is for.
Semantic linters
The next layer up — tools like semgrep, codeql, and the type-aware passes inside tsc --strict — do understand a little semantics. They can spot SQL injection, unused promise rejections, taint flowing from an HTTP body into a shell command. This is where most "shift left" tools live.
And they catch a lot. CodeQL famously found a class of buffer overflows nobody had spotted in two decades of OpenSSL. Semgrep rules eliminated entire categories of secret leakage at Slack. The problem isn't that semantic linters are bad. The problem is that their world ends at the type system. Anything that depends on runtime state — a config value, a feature flag, the contents of a cache, the order in which two services reply — is past the horizon.
A type checker can prove the absence of a class of bugs. It cannot prove the presence of correct behavior. — Pierce, Types and Programming Languages, 2002
LLM reviewers
The new entrant. Drop a PR into Cursor / GitHub Copilot Review / CodeRabbit and you'll get back paragraphs of plausible-sounding feedback. Some of it is genuinely useful. A lot of it is pattern-matched off training data — the model has seen ten thousand useEffect hooks and confidently tells you about the eleventh.
The deeper problem isn't accuracy, it's steerability. An LLM reviewer doesn't know that your team treats idempotency-key collisions as a P1 incident, that your payment service has been migrated to gRPC twice and the third migration is happening this quarter, that the change you just made touches code your CTO wrote eighteen months ago and considers sacred. It guesses at norms instead of executing against them.
The composition problem
Here's the part nobody talks about. Each of these layers produces output in a different shape:
| Layer | Output | Confidence | Actionable? |
|---|---|---|---|
| Syntax | Discrete violations | 100% | Always |
| Semantic | Path-sensitive findings | ~85% | Usually |
| LLM | Prose | Unknown | Sometimes |
| Trace-aware | Counterfactuals | Empirical | Always |
Three of these can co-exist in the same PR comment thread. They cannot collaborate. A linter rule firing in line 4 doesn't suppress the LLM's confident-but-wrong claim about line 12. The semantic analyzer's warning about a possible null deref doesn't get upgraded by the LLM noticing that a teammate hit the same bug in October. There's no joint inference.
Why we added an execution trace
Halfway through 2024 we ran an experiment: take 200 PRs that shipped a real production incident, and feed each one through every tool we could find. The headline number — none of them caught more than 31% — wasn't surprising. The interesting result was which 31% they each caught:
The overlap was tiny. Each tool was finding a distinct set of bugs. Stacking them got us to ~58% combined coverage — better than any individual tool, still far short of where teams assume they are.
The 42% that escaped had one thing in common: they were state-dependent. A retry storm hits this code path, the idempotency key collides, the database returns a row that didn't exist when we ran the type checker. None of the static layers could see this because none of them were running the code.
So we ran the code. Or more precisely — we instrumented the candidate diff against a frozen production-like trace, replayed ten thousand real requests against both versions, and diffed the behavior, not the source.
That's the bug from the opening anecdote. No linter caught it. No type system caught it. Two LLM reviewers said "looks good." The trace caught it on the first replay because the trace knows what retries look like.
Modeling team-specific risk
Catching the bug is half the job. The other half is knowing whether to block the PR, comment, or stay quiet — and that depends entirely on context the tool can't infer from the code alone.
We let teams describe their tolerance as a small declarative policy. It's checked into the repo, reviewed like any other config:
rules:− - name: "idempotency"− severity: warn− scope:+ - name: "idempotency-on-retry-path"+ severity: block+ scope:+ condition: trace.has_retry && diff.calls_external_charge+ message: |+ External charge invoked on retry path. We had a P1 here in Oct.+ See: incident-2024-10-14.
A policy is just an (input → severity, message) function. The inputs are everything we know about the diff — AST, types, semgrep findings, LLM observations, and the trace replay. The output drives a comment, a label, a block. It's legible. It's reviewable. It's yours.
A walk-through: real PR, real bug
Here's a sandboxed version of a PR that came in last week. The diff looks innocuous. Try toggling the views to see how each layer reads it differently.
The interesting move isn't any single layer. It's the composition — the trace flips the LLM's hedge ("maybe add jitter") into a hard signal ("this rethrow drops retries on the failure mode that actually happens"), and the policy turns that signal into a blocking comment with a link to the October incident.
Six months of production data
We rolled this out to 31 design partner teams between September 2024 and February 2025. The numbers, edited for confidentiality:
3 months on Coding Assist: caught 4 things our existing stack missed (incl. one that would've been a P0). Caught 0 things the existing stack already caught. Net new signal, no overlap. The composition really matters.
Across the cohort: 2.1× incident-causing bugs caught compared to baseline (linter + semgrep + LLM review), with 0.4× the false-positive rate of the LLM-only setup. The full report — methodology, dataset, every caveat — is in the supplementary write-up.
Takeaways
- Static layers don't compose. Stacking three tools that don't talk to each other gets you a stack of three blind spots.
- The bugs that survive are state-dependent. If you can't replay against real traffic, you can't see them.
- Risk is team-specific. A blocking rule for one team is noise for another. Tools that can't accept policy can't be trusted to block.
- Postmortems are policy material. The single best signal of where to invest review effort is "what broke last quarter."
None of this is a silver bullet. We still miss bugs. Some of them we catch only after the fact, by replaying production traffic against a fix and watching the policy fire post-hoc. But the shape of what gets through has changed — it's no longer the obvious things, the things a grown-up review process should have surfaced.
If you're curious how this looks in your own codebase, we run pilots with teams of 5–50. The setup takes a day; the trace fills out over a week of normal traffic; the first useful policies usually emerge after the first incident retro. Reach out at elena@codingassist.bot if you'd like a deeper look.