Skip to main content

26 posts tagged with "code-review"

View all tags

The AI Told Me So Defense: When Code Review Quietly Stops Pushing Back

· 11 min read
Tian Pan
Software Engineer

The single most expensive sentence in a 2026 code review thread is "the agent wrote it this way." Not because it's wrong — sometimes it isn't — but because it ends a conversation that used to start one. The reviewer types a question, the author quotes the model's reasoning back at them, and the thread resolves before anyone has actually argued about the change. The social cost of disagreeing with a confident, well-spoken model has quietly become higher than the cost of merging a subtle bug, and most teams won't see the trade in their metrics for another two quarters.

This is not a story about whether AI writes good code. It writes code, some of it good. This is a story about what happens to a quality gate when the friction at composition time collapses. Review velocity rises, defect rate rises in lockstep, and the correlation isn't obvious because nobody is tracking review-time-to-defect with the author class attached. The senior engineer who used to be the gravity well of taste in the codebase becomes the lone holdout in a culture quietly recalibrating around model deference.

The Prompt Author Identity Problem: Three Roles Editing the Same File

· 13 min read
Tian Pan
Software Engineer

Pull up the git blame on any year-old production system prompt and you will find something the engineering team is not ready to admit: the file has three authors, none of whom share a definition of what a "change" is. The engineer who refactored the instruction blocks last month logged the commit as "no functional change, just reordering for clarity." The product manager who reads the file once a quarter would describe the same diff as "you rewrote the voice — customers will notice." The ML engineer running the regression suite would call it "you broke few-shot example three, and the eval has been red ever since."

All three are right. The prompt is simultaneously code, spec, and hyperparameter, and every team that ships an AI feature long enough discovers that the file's commit history is a slow-motion three-way authorship dispute that CODEOWNERS does not capture and the diff viewer does not surface.

The Mixed PR Queue: Reviewer Throughput Is Now the Binding Constraint

· 10 min read
Tian Pan
Software Engineer

For the last twenty years, the Theory of Constraints answer in software delivery was the same: the bottleneck is producing code. We tooled around that assumption — pair programming, IDE autocompletion, faster CI, smaller services, all designed to push more code through a fixed-width review pipe. Then coding agents arrived, the production side of the pipe got 5–10x wider, and the review pipe stayed exactly the same width. A senior engineer who used to open three PRs a week now supervises a fleet that opens thirty in an afternoon. The team's velocity is no longer set by how fast anyone writes code. It's set by how fast a human can read it.

This is not a future problem. Median PR review time has been measured at +441% year over year in some samples, and 31% more PRs are merging with zero review — not by policy, but because reviewers gave up trying to keep pace. Stripe is shipping over a thousand agent-produced PRs per week. Feature-branch throughput grew 59% YoY in one benchmark while main-branch throughput fell 7% — code is being written, but it's not getting promoted, because it's stuck in review.

Spec, Code, Tests, One Author: The Independence You Quietly Lost

· 11 min read
Tian Pan
Software Engineer

When the same model writes the requirements, implements them, and authors the assertions that say it is correct, "all tests pass" is no longer evidence the feature works. It is evidence the model is internally consistent. Those are different things, and the difference is the entire point of having tests in the first place.

The standard story we tell about test suites is that they are a second opinion. The author wrote the code with one mental model of the requirement, the test author wrote the assertions with a slightly different mental model, and the points where the two models disagree are where the bugs live. That story depends on the test author having a different cognitive vantage point than the code author. Strip out the difference in vantage points and the test suite stops carrying any independent information about correctness — it only carries information about consistency.

Prompt Edits Aren't Wording Changes: A Code Review Discipline for Prompts as Software

· 11 min read
Tian Pan
Software Engineer

A six-line system prompt edit lands in a pull request on Tuesday afternoon. The diff is in plain English. Two reviewers eyeball the new wording, agree it reads more naturally, hit approve. The PR merges in under a minute. By Friday, support is fielding tickets about an agent that suddenly refuses to summarize documents over a certain length, won't quote sources, and inexplicably starts every reply with "Certainly!" — a behavior nobody asked for and the diff didn't predict.

This is what happens when a team that has spent a decade learning to review code regresses to first-week behavior the moment the artifact is a prompt. The diff looks harmless because it reads like English, and English is what humans review with their eyes. The discipline that makes code review work — running the tests, examining the blast radius, treating "small changes" with appropriate skepticism — quietly does not transfer. The wording got better; the behavior got worse; nobody noticed until users did.

The AI Code Review Inversion: What to Focus on When the Author Is a Machine

· 9 min read
Tian Pan
Software Engineer

Your code review is optimizing for the wrong thing. When AI agents contribute the majority of your commits, reviewing for local correctness — does this function do what it says? — is like grading a math test by checking the handwriting. The machine already passed your linter, ran your test suite, and formatted the output to spec. The bugs it ships are not the bugs line-by-line review catches.

A large-scale study of GitHub pull requests found that AI-co-authored PRs contain 1.7x more issues than human-only PRs — including 75% more logic and correctness issues, 2.74x more security vulnerabilities, and 3x more readability problems. Not because the code looks wrong. Because it does the wrong thing, in the wrong place, with the wrong assumptions about the rest of the system. Those are precisely the failure modes that traditional code review, optimized for catching typos and style violations, is not designed to find.

The Invisible Author Problem: Git Blame When AI Writes Most of Your Code

· 8 min read
Tian Pan
Software Engineer

When something breaks in production, the first thing engineers reach for is git blame. The commit hash links to a PR. The PR links to an author. The author links to context — a Slack thread, a design doc, a brain that remembers why the code was written that way. This chain is how teams debug incidents, conduct security audits, and accumulate institutional knowledge. It assumes that every line of code has a human author who understood what they were doing.

AI has quietly broken that assumption. Roughly 46% of code is now AI-generated, with Java shops pushing that figure past 60%. Most of that code carries no meaningful provenance metadata. The git blame chain still runs — it just now terminates at a developer who accepted a suggestion they may not have fully understood, with no record of the prompt, the model version, or the alternatives the AI rejected.

AI Writes Code in Seconds. Your Team Reviews It for Hours. The Math Isn't Working.

· 8 min read
Tian Pan
Software Engineer

The ROI pitch for AI coding tools is irresistible on paper: developers complete tasks 55% faster in controlled experiments, ship 98% more pull requests, and report saving 3.6 hours per week. But when organizations look at their actual delivery metrics — bug rates, release cycle times, incident frequency — the numbers barely move. Something is absorbing all those gained hours, and it's not hard to find.

AI generates code in seconds. Engineers still review it at the same pace they always have.

Code Ownership Decay: What Happens to Team Knowledge When AI Writes Most Commits

· 9 min read
Tian Pan
Software Engineer

When a bug surfaces in production, the first ritual is the same: open git blame, find who wrote the line, ask them why. That ritual assumes the author had a reason — a constraint they knew, an edge case they handled deliberately, a business rule they'd internalized from three quarters of postmortems. For most of software history, git blame answered a question about intent.

Now, for a growing share of commits, git blame points to a human who merged the code and an AI that generated it. The human may have spent 90 seconds reading the diff. The AI had no context beyond the prompt. The "why" — the institutional knowledge that made git blame useful — was never written down anywhere.

This is code ownership decay. It doesn't announce itself. No single commit breaks the system. Instead, understanding slowly hollows out until the team reaches a decision point — a refactor, an incident, a new hire ramping up — and discovers that nobody can explain the system from the inside anymore.

LLM Code Review in Production: Building a Diff Pipeline That Engineers Actually Trust

· 9 min read
Tian Pan
Software Engineer

Most teams that deploy an LLM code reviewer discover the same failure mode within two weeks: the model produces 10–20 comments per pull request, 80% of which are noise. After the third PR where a developer dismisses every comment without reading them, the tool is effectively dead — notifications routed to a channel no one watches, the bot still spending compute on every push.

The problem isn't the model. It's that the teams shipped a comment generator and called it a reviewer.

Your Coding Agent Is a Junior Engineer Who Never Reads the Tests

· 10 min read
Tian Pan
Software Engineer

The benchmark numbers tell a strange story. On SWE-bench Verified, multiple agent products running the same underlying model — Auggie, Cursor, Claude Code, all on Opus 4.5 — produced wildly different results. Auggie solved 17 more problems out of 731 than its closest peer despite the identical brain. The gap was scaffolding: how the agent was prompted, what context it was given, which tools it could call, and what the harness did when it got confused. The model is a commodity. The scaffolding around it is the product.

This is the same realization mature engineering teams reached about junior engineers a decade ago. A bright graduate doesn't ship value because the model is good. They ship value because the README is current, the test suite is fast, the code review rubric catches the same six mistakes every time, and someone wrote a CONTRIBUTING.md that names the constraints. Strip that scaffolding away and the same person produces locally coherent, globally wrong code that breaks production invariants the team didn't know to write down.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.