AI coding agents are very good at producing a lot of code quickly. They are much less good at guaranteeing that every changed line preserves the assumptions your system quietly depends on.
That is why the review problem has changed. The bottleneck is no longer writing the first draft. It is deciding which machine-generated pull requests are safe to merge, which need human redesign, and which should be sent back with tighter constraints.
The right answer is not to review every AI PR like a handwritten masterpiece, and it is definitely not to rubber-stamp anything that passes tests. The practical answer is to build a review loop that narrows the diff, checks the invariants that matter, and pushes routine validation into automation.
Why AI PRs feel harder to review
Most AI-generated pull requests fail in familiar ways: they touch too much code, mix cleanup with behavior changes, and make the reviewer reconstruct intent from the diff instead of receiving it clearly up front.
- they touch more files than the task really required
- they make style and structure changes alongside behavior changes
- they preserve surface behavior while quietly weakening edge-case guarantees
- they fix the visible bug while smuggling in unrelated refactors
- they introduce plausible-looking abstractions that nobody asked for
The first rule: force the PR to stay narrow
If an agent can edit twenty files, it often will. Review gets dramatically easier when the task definition limits what the agent is allowed to touch.
- the exact bug or outcome to change
- the preferred files or modules to modify
- constraints on tests, migrations, and public interfaces
- a rule against opportunistic cleanup
- a requirement to explain the risky parts in the PR summary
When the prompt is narrow, the review can be narrow too.
Review the invariant, not just the implementation
Human reviewers should spend less time asking whether the code looks smart and more time asking whether key guarantees still hold.
For each AI PR, define the invariants that must remain true. Examples include authentication remaining fail-closed, retries staying idempotent, write paths validating server-side, pagination order remaining stable, and caching never leaking across tenants.
Use a lightweight risk rubric
Not every AI-generated PR deserves the same level of suspicion. A cheap triage rubric helps reviewers move at the right speed.
Low risk
- copy changes
- UI text or spacing
- isolated tests
- small logging improvements
Medium risk
- controller logic
- query changes
- validation paths
- retry behavior
- feature flags
High risk
- auth and permissions
- payments or billing
- schema and migration changes
- concurrency and locking
- caching, queues, and background job semantics
- security-sensitive parsing or deserialization
A PR template that works better with agents
Make the agent fill in structured review context so the human reviewer starts with intent, constraints, and reviewer focus instead of guessing.
## What changed
- ...
## Why this approach
- ...
## Files touched
- ...
## Invariants checked
- auth remains fail-closed
- API response shape unchanged
- migration compatibility preserved
## Tests run
- ...
## Reviewer focus
- Please verify retry behavior in webhook deliveryDiff slicing beats scrolling
Large AI PRs often look harmless until you read them in chunks. Review public API changes, validation and authorization, state changes, tests, and cleanup separately. If the PR mixes all of them, send it back for a narrower retry.
Push routine skepticism into CI
Humans should not spend time manually repeating checks a machine can run more reliably. For AI PRs, the CI pipeline should do as much of the boring suspicion as possible.
name: ai-pr-guardrails
on:
pull_request:
branches: [master]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm ci
- run: npm run lint
- run: npm run typecheck
- run: npm test -- --runInBand
- run: python3 scripts/check-pr-scope.pyAdd one scope check that agents cannot sweet-talk
AI-generated PRs often violate scope in subtle ways. A tiny script that compares changed files against a restricted-path list catches a lot of nonsense early.
import subprocess
import sys
changed = subprocess.check_output([
'git', 'diff', '--name-only', 'origin/master...HEAD'
], text=True).splitlines()
restricted = {
'app/auth/',
'infra/terraform/',
'db/migrations/',
}
violations = [
changed_path for changed_path in changed
if any(changed_path.startswith(prefix) for prefix in restricted)
]
if violations:
print('Restricted paths require explicit human approval:')
for changed_path in violations:
print(f'- {changed_path}')
sys.exit(1)Review tests like contracts, not decoration
Agents are very willing to add tests that merely ratify their implementation. Good review asks whether the tests fail for the old bug, prove the intended contract, and cover the edge case that could silently regress.
Watch for the five most common AI review smells
1. Too much confidence around unclear behavior
If the code introduces a new branch, fallback, or default without explaining why, pause.
2. Cosmetic churn around a real fix
Unrelated renames and formatting changes make review slower and hide risk.
3. Dead abstractions
New helper layers that save three lines today can cost three hours later.
4. Missing negative paths
Happy-path tests pass while permission failures, null inputs, or retry loops stay untested.
5. Correct local change, broken system behavior
The function is cleaner, but cache invalidation, ordering, or retries changed downstream.
Use CODEOWNERS and branch protections like a grown-up
The easiest way to normalize safe AI PR review is to stop relying on memory and put the policy in the repo: require CODEOWNERS review on sensitive paths, require status checks, block force-pushes, require conversation resolution, and prefer squash merge for noisy agent iterations.
A practical human review loop
- read the PR summary before the diff
- classify risk level
- inspect touched files for scope creep
- read tests before implementation on risky changes
- verify invariants and side effects
- confirm CI evidence, not just green badges
- request a narrower retry if the change mixes concerns
The real goal is faster trust, not blind trust
AI coding tools are not making code review obsolete. They are making review system design more important.
If you want fast merges without becoming the bottleneck, do three things well: keep the PR narrow, make invariants explicit, and automate repetitive suspicion. Once those are in place, the human reviewer can spend attention on judgment instead of diff archaeology.