post · May 18, 2026 · 3 min read · Review volume is the wrong bottleneck
back to archive
3 min read

Review volume is the wrong bottleneck

PatchRelay can keep agents busy, and that quickly makes manual review feel impossible. The review-quill repair loop absorbs much of the code-review work. The harder question — whether the agents are still steering the product in the right direction — is still open.

Once you've got a few agent runs in flight, code-review volume looks like the next bottleneck to optimize. I don't think it is.

I don't want to become a faster reviewer. I want to stop being the person every generated PR has to wait for. If the system needs me reading every diff line by line, the agents didn't remove the work — they shoved it sideways into a new queue with my name on it.

The split I keep coming back to is review versus validation. Review is "does the diff hold up": tests, invariants, edge cases, the docs that quietly went stale. Validation is "is the product still going in the right direction." A PR can clear the first bar and fail the second without anyone noticing, because the code is fine and the product has drifted half a step sideways.

Automating the first one is mostly tractable. review-quill reviews the PR; patchrelay (or plain Codex/Claude) repairs what it flags. What lands isn't perfect software, but it's been through a few adversarial passes before I see it. Since early April, my local review-quill database shows 3,913 review attempts across 1,309 PRs — 1,579 approved, 2,131 sent back for changes. Patchrelay has logged 2,710 runs across 733 issues; review_fix is the biggest non-implementation category, with 987 runs. Most of the time the system argues with itself, and I don't have to join the argument.

Direction is the harder problem. It's easy to run many agents in parallel, and easy for them to confidently steer the product somewhere subtly wrong. I can't reliably catch that by reading code, because the code looks plausible and the drift only shows up in the product.

So the question I want to be working on is shifting from "how do I review all this code" to "how do I tell whether the product still looks, feels, and behaves right." I don't expect one perfect eval to solve it. I expect layers: automated checks, screenshots, flow tests, human taste, user feedback, and enough captured intent that the agents are optimizing for the actual product instead of producing plausible diffs. With usertold.ai still in beta, I'm not there yet.

Better PR-review tooling would still help in the meantime. GitHub is a rough surface for large diffs — laggy, file-by-file, eager to hide parts of the change. At agent scale, what I need is help seeing what matters, not another way to page through what changed.