Skip to main content

31 posts tagged with "code-review"

View all tags

The PR Description Your Coding Agent Generated That Humans Stopped Reading

· 11 min read
Tian Pan
Software Engineer

A year ago your team adopted a PR description template. It had a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes. Reviewers loved it: every PR had context, every PR had a test plan, every PR had structure. Six months later the coding agent learned to fill it in. Now every PR has a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes — and reviewers no longer read past the title. The format that once focused attention now signals that there is nothing worth focusing on. The structure outlived the signal it carried.

This is not a code-quality problem. The code in those PRs is often fine. The problem is that the act of writing a description has been amputated from the act of thinking about the change, and the description is the artifact reviewers used to triage what to spend their finite attention on. When that artifact becomes uniformly formatted, plausibly worded, and indistinguishable from every other PR, the reviewer's attention triage breaks. The system that used to surface the unusual now flattens everything into the same shape.

The Idiom Your Coding Agent Wrote Around Instead Of Using

· 11 min read
Tian Pan
Software Engineer

A senior engineer on a payments team I work with told me a story that I think every team running coding agents will eventually live through. Their codebase has a Result<T, E> wrapper — homegrown, sits in a single core/result.ts file, used in roughly two hundred call sites across the service. New code is expected to thread Result through every function that can fail; throwing is reserved for genuinely unexpected states. It's not enforced by a lint rule. It is the dialect.

Six months into shipping with a coding agent, they audited the diffs the agent had merged. About a third of the new functions ignored Result entirely. The agent had reached for try/catch, returned T | null, thrown Error subclasses with descriptive messages — every one of those choices is correct in some imagined codebase. None of them was correct in this one. The code typechecked. The tests passed. Reviewers approved it because nothing in it looked wrong line by line. But the file the agent touched no longer fit the file it lived next to, and the team had quietly grown a second dialect inside their own service.

This is the failure mode I want to talk about: not bugs, not hallucinations, not lint violations — idiomatic drift. The agent ships code that compiles, runs, and passes tests, in a style your codebase does not speak. Over enough merges, the codebase bifurcates into agent-style zones and human-style zones, and the cost shows up in places no dashboard is watching.

The PR-Bot That Never Sleeps: When Your Reviewers Become the Rate Limiter

· 11 min read
Tian Pan
Software Engineer

For two decades the bottleneck in software engineering was writing code. We optimized IDEs, autocompletion, refactoring tools, and frameworks to make typing cheaper. We won. Now the bottleneck moved one step downstream: writing is cheap, and reading is expensive. The PR-bot can spin up ten implementation attempts in parallel and open ten pull requests against your repo before you finish your morning coffee. Your reviewers cannot.

The rate limiter for AI-assisted software delivery is no longer the model's tokens per second. It is the number of human eyes you can put on a diff per day. And when those eyes get overwhelmed, you do not get a graceful degradation — you get rubber stamps. Code merges with LGTM 🚀 on top of code that nobody actually read. A senior engineer approves an AI-written patch that another AI tool already reviewed, and three weeks later a data-inconsistency bug eats forty hours of someone's life. Surface correctness is not systemic correctness, and a green pipeline is not understanding.

The PR Description Your Coding Agent Cannot Write

· 10 min read
Tian Pan
Software Engineer

Your coding agent finished the task. The diff is small, the tests are green, the lint is clean, and the PR body says, in its entirety, "Fixes the bug in module X." A reviewer six time zones away opens the page, reads the diff in isolation, sees nothing wrong with it, and approves a technically correct change that solves the wrong problem. The change ships. Two days later a customer asks why the workaround they had been relying on stopped working, and you discover that the bug your agent fixed was not the bug the ticket was about.

The code was fine. The reviewer was conscientious. The agent did exactly what it was asked. The artifact between them — the pull request — was empty of everything that would have caught the mistake.

A Prompt Diff Hides Its Own Blast Radius

· 9 min read
Tian Pan
Software Engineer

A pull request lands in your review queue. The diff shows three words changed inside a system prompt: Output strictly valid JSON became Always respond using clean, parseable JSON. It reads like a copy edit. You skim it, the CI checkmark is green, and you click approve. Total time: ninety seconds.

Six hours later, the downstream parser starts rejecting responses with trailing commas and missing fields. The structured-output error rate climbs from near-zero to double digits, and a revenue-generating workflow stalls. Nothing in the diff predicted this. Nothing in the diff could have predicted this, because the diff measured the wrong thing.

This is the central problem with reviewing prompt changes: the size of a prompt diff tells you nothing about the size of its effect. A three-word change and a three-paragraph rewrite are both just text, and a text diff renders them with the same visual weight as any other edit. But a prompt is not text that describes behavior — it is text that causes behavior, and the causal blast radius of an edit is invisible in the artifact you are reviewing.

The AI Told Me So Defense: When Code Review Quietly Stops Pushing Back

· 11 min read
Tian Pan
Software Engineer

The single most expensive sentence in a 2026 code review thread is "the agent wrote it this way." Not because it's wrong — sometimes it isn't — but because it ends a conversation that used to start one. The reviewer types a question, the author quotes the model's reasoning back at them, and the thread resolves before anyone has actually argued about the change. The social cost of disagreeing with a confident, well-spoken model has quietly become higher than the cost of merging a subtle bug, and most teams won't see the trade in their metrics for another two quarters.

This is not a story about whether AI writes good code. It writes code, some of it good. This is a story about what happens to a quality gate when the friction at composition time collapses. Review velocity rises, defect rate rises in lockstep, and the correlation isn't obvious because nobody is tracking review-time-to-defect with the author class attached. The senior engineer who used to be the gravity well of taste in the codebase becomes the lone holdout in a culture quietly recalibrating around model deference.

The Prompt Author Identity Problem: Three Roles Editing the Same File

· 13 min read
Tian Pan
Software Engineer

Pull up the git blame on any year-old production system prompt and you will find something the engineering team is not ready to admit: the file has three authors, none of whom share a definition of what a "change" is. The engineer who refactored the instruction blocks last month logged the commit as "no functional change, just reordering for clarity." The product manager who reads the file once a quarter would describe the same diff as "you rewrote the voice — customers will notice." The ML engineer running the regression suite would call it "you broke few-shot example three, and the eval has been red ever since."

All three are right. The prompt is simultaneously code, spec, and hyperparameter, and every team that ships an AI feature long enough discovers that the file's commit history is a slow-motion three-way authorship dispute that CODEOWNERS does not capture and the diff viewer does not surface.

The Mixed PR Queue: Reviewer Throughput Is Now the Binding Constraint

· 10 min read
Tian Pan
Software Engineer

For the last twenty years, the Theory of Constraints answer in software delivery was the same: the bottleneck is producing code. We tooled around that assumption — pair programming, IDE autocompletion, faster CI, smaller services, all designed to push more code through a fixed-width review pipe. Then coding agents arrived, the production side of the pipe got 5–10x wider, and the review pipe stayed exactly the same width. A senior engineer who used to open three PRs a week now supervises a fleet that opens thirty in an afternoon. The team's velocity is no longer set by how fast anyone writes code. It's set by how fast a human can read it.

This is not a future problem. Median PR review time has been measured at +441% year over year in some samples, and 31% more PRs are merging with zero review — not by policy, but because reviewers gave up trying to keep pace. Stripe is shipping over a thousand agent-produced PRs per week. Feature-branch throughput grew 59% YoY in one benchmark while main-branch throughput fell 7% — code is being written, but it's not getting promoted, because it's stuck in review.

Spec, Code, Tests, One Author: The Independence You Quietly Lost

· 11 min read
Tian Pan
Software Engineer

When the same model writes the requirements, implements them, and authors the assertions that say it is correct, "all tests pass" is no longer evidence the feature works. It is evidence the model is internally consistent. Those are different things, and the difference is the entire point of having tests in the first place.

The standard story we tell about test suites is that they are a second opinion. The author wrote the code with one mental model of the requirement, the test author wrote the assertions with a slightly different mental model, and the points where the two models disagree are where the bugs live. That story depends on the test author having a different cognitive vantage point than the code author. Strip out the difference in vantage points and the test suite stops carrying any independent information about correctness — it only carries information about consistency.

Prompt Edits Aren't Wording Changes: A Code Review Discipline for Prompts as Software

· 11 min read
Tian Pan
Software Engineer

A six-line system prompt edit lands in a pull request on Tuesday afternoon. The diff is in plain English. Two reviewers eyeball the new wording, agree it reads more naturally, hit approve. The PR merges in under a minute. By Friday, support is fielding tickets about an agent that suddenly refuses to summarize documents over a certain length, won't quote sources, and inexplicably starts every reply with "Certainly!" — a behavior nobody asked for and the diff didn't predict.

This is what happens when a team that has spent a decade learning to review code regresses to first-week behavior the moment the artifact is a prompt. The diff looks harmless because it reads like English, and English is what humans review with their eyes. The discipline that makes code review work — running the tests, examining the blast radius, treating "small changes" with appropriate skepticism — quietly does not transfer. The wording got better; the behavior got worse; nobody noticed until users did.

The AI Code Review Inversion: What to Focus on When the Author Is a Machine

· 9 min read
Tian Pan
Software Engineer

Your code review is optimizing for the wrong thing. When AI agents contribute the majority of your commits, reviewing for local correctness — does this function do what it says? — is like grading a math test by checking the handwriting. The machine already passed your linter, ran your test suite, and formatted the output to spec. The bugs it ships are not the bugs line-by-line review catches.

A large-scale study of GitHub pull requests found that AI-co-authored PRs contain 1.7x more issues than human-only PRs — including 75% more logic and correctness issues, 2.74x more security vulnerabilities, and 3x more readability problems. Not because the code looks wrong. Because it does the wrong thing, in the wrong place, with the wrong assumptions about the rest of the system. Those are precisely the failure modes that traditional code review, optimized for catching typos and style violations, is not designed to find.

The Invisible Author Problem: Git Blame When AI Writes Most of Your Code

· 8 min read
Tian Pan
Software Engineer

When something breaks in production, the first thing engineers reach for is git blame. The commit hash links to a PR. The PR links to an author. The author links to context — a Slack thread, a design doc, a brain that remembers why the code was written that way. This chain is how teams debug incidents, conduct security audits, and accumulate institutional knowledge. It assumes that every line of code has a human author who understood what they were doing.

AI has quietly broken that assumption. Roughly 46% of code is now AI-generated, with Java shops pushing that figure past 60%. Most of that code carries no meaningful provenance metadata. The git blame chain still runs — it just now terminates at a developer who accepted a suggestion they may not have fully understood, with no record of the prompt, the model version, or the alternatives the AI rejected.