Skip to main content

33 posts tagged with "code-review"

View all tags

The Codebase Index Your Coding Agent Rebuilt From a Checkout Three Weeks Behind Main

· 10 min read
Tian Pan
Software Engineer

A coding agent on your team opens a pull request that calls parseUserToken() four times across two files. The function does not exist in the repository, has not existed for nineteen days, and was replaced by decodeSessionClaim() in a commit your engineers all remember reviewing. The agent did not invent the name. It read the name from its semantic index — a vector store rebuilt from a working copy that was twenty-one days behind main. The agent's edit step, by contrast, ran git pull at session start and operated on fresh code. Two views of the same repository, three weeks apart, and the agent confidently bridged them with code that does not compile against anything real.

This is the failure mode that doesn't announce itself. The agent ran. The tests appeared to pass. The PR landed. The first reviewer noticed only because a stubbed-out function shared a name with an unrelated helper and tripped the linter. By then the agent had spent a full sprint writing against a phantom version of the codebase, and no one on the team — including the agent — had any signal that something was wrong.

The Pull Request Your Coding Agent Opened That Closed a Real One

· 11 min read
Tian Pan
Software Engineer

Your coding agent opened a pull request at 3:14 on a Tuesday afternoon. The PR description was clean, the diff was small, the CI was green. It got squash-merged twenty minutes later. The teammate who came back from lunch at 1:20 the next day saw a notification: "PR #1247 was closed." Not merged. Closed. The branch was gone. The seventy-two review comments she'd left on it the previous week were gone too — collapsed under an "outdated" label on a PR that no longer existed in any active list. A senior engineer's design decisions, two rounds of back-and-forth with the security reviewer, and a careful migration plan that took a week to negotiate, all vanished into a footnote on a different PR that nobody had read closely. The squash commit's only trace of what happened was a one-line tag at the bottom: Closed by #1893.

This is the failure mode of trusting a coding agent to write its own pull request metadata. Not the code — the metadata. The diff was fine. The agent did good work. What it could not do was distinguish a fresh discussion from a stale one, and GitHub's auto-close machinery treats every closing keyword the agent writes as a load-bearing instruction. Your agent reads the comments to gather context, infers from a six-month-old reply that its work supersedes an older PR, writes Closes #1247 in the description it generates, and the merge does the rest — silently, mechanically, irrevocably from the perspective of anyone who wasn't watching the diff at the moment of squash.

The PR Description Your Coding Agent Generated That Humans Stopped Reading

· 11 min read
Tian Pan
Software Engineer

A year ago your team adopted a PR description template. It had a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes. Reviewers loved it: every PR had context, every PR had a test plan, every PR had structure. Six months later the coding agent learned to fill it in. Now every PR has a ## Summary, a ## Changes, a ## Test plan, and a row of checkboxes — and reviewers no longer read past the title. The format that once focused attention now signals that there is nothing worth focusing on. The structure outlived the signal it carried.

This is not a code-quality problem. The code in those PRs is often fine. The problem is that the act of writing a description has been amputated from the act of thinking about the change, and the description is the artifact reviewers used to triage what to spend their finite attention on. When that artifact becomes uniformly formatted, plausibly worded, and indistinguishable from every other PR, the reviewer's attention triage breaks. The system that used to surface the unusual now flattens everything into the same shape.

The Idiom Your Coding Agent Wrote Around Instead Of Using

· 11 min read
Tian Pan
Software Engineer

A senior engineer on a payments team I work with told me a story that I think every team running coding agents will eventually live through. Their codebase has a Result<T, E> wrapper — homegrown, sits in a single core/result.ts file, used in roughly two hundred call sites across the service. New code is expected to thread Result through every function that can fail; throwing is reserved for genuinely unexpected states. It's not enforced by a lint rule. It is the dialect.

Six months into shipping with a coding agent, they audited the diffs the agent had merged. About a third of the new functions ignored Result entirely. The agent had reached for try/catch, returned T | null, thrown Error subclasses with descriptive messages — every one of those choices is correct in some imagined codebase. None of them was correct in this one. The code typechecked. The tests passed. Reviewers approved it because nothing in it looked wrong line by line. But the file the agent touched no longer fit the file it lived next to, and the team had quietly grown a second dialect inside their own service.

This is the failure mode I want to talk about: not bugs, not hallucinations, not lint violations — idiomatic drift. The agent ships code that compiles, runs, and passes tests, in a style your codebase does not speak. Over enough merges, the codebase bifurcates into agent-style zones and human-style zones, and the cost shows up in places no dashboard is watching.

The PR-Bot That Never Sleeps: When Your Reviewers Become the Rate Limiter

· 11 min read
Tian Pan
Software Engineer

For two decades the bottleneck in software engineering was writing code. We optimized IDEs, autocompletion, refactoring tools, and frameworks to make typing cheaper. We won. Now the bottleneck moved one step downstream: writing is cheap, and reading is expensive. The PR-bot can spin up ten implementation attempts in parallel and open ten pull requests against your repo before you finish your morning coffee. Your reviewers cannot.

The rate limiter for AI-assisted software delivery is no longer the model's tokens per second. It is the number of human eyes you can put on a diff per day. And when those eyes get overwhelmed, you do not get a graceful degradation — you get rubber stamps. Code merges with LGTM 🚀 on top of code that nobody actually read. A senior engineer approves an AI-written patch that another AI tool already reviewed, and three weeks later a data-inconsistency bug eats forty hours of someone's life. Surface correctness is not systemic correctness, and a green pipeline is not understanding.

The PR Description Your Coding Agent Cannot Write

· 10 min read
Tian Pan
Software Engineer

Your coding agent finished the task. The diff is small, the tests are green, the lint is clean, and the PR body says, in its entirety, "Fixes the bug in module X." A reviewer six time zones away opens the page, reads the diff in isolation, sees nothing wrong with it, and approves a technically correct change that solves the wrong problem. The change ships. Two days later a customer asks why the workaround they had been relying on stopped working, and you discover that the bug your agent fixed was not the bug the ticket was about.

The code was fine. The reviewer was conscientious. The agent did exactly what it was asked. The artifact between them — the pull request — was empty of everything that would have caught the mistake.

A Prompt Diff Hides Its Own Blast Radius

· 9 min read
Tian Pan
Software Engineer

A pull request lands in your review queue. The diff shows three words changed inside a system prompt: Output strictly valid JSON became Always respond using clean, parseable JSON. It reads like a copy edit. You skim it, the CI checkmark is green, and you click approve. Total time: ninety seconds.

Six hours later, the downstream parser starts rejecting responses with trailing commas and missing fields. The structured-output error rate climbs from near-zero to double digits, and a revenue-generating workflow stalls. Nothing in the diff predicted this. Nothing in the diff could have predicted this, because the diff measured the wrong thing.

This is the central problem with reviewing prompt changes: the size of a prompt diff tells you nothing about the size of its effect. A three-word change and a three-paragraph rewrite are both just text, and a text diff renders them with the same visual weight as any other edit. But a prompt is not text that describes behavior — it is text that causes behavior, and the causal blast radius of an edit is invisible in the artifact you are reviewing.

The AI Told Me So Defense: When Code Review Quietly Stops Pushing Back

· 11 min read
Tian Pan
Software Engineer

The single most expensive sentence in a 2026 code review thread is "the agent wrote it this way." Not because it's wrong — sometimes it isn't — but because it ends a conversation that used to start one. The reviewer types a question, the author quotes the model's reasoning back at them, and the thread resolves before anyone has actually argued about the change. The social cost of disagreeing with a confident, well-spoken model has quietly become higher than the cost of merging a subtle bug, and most teams won't see the trade in their metrics for another two quarters.

This is not a story about whether AI writes good code. It writes code, some of it good. This is a story about what happens to a quality gate when the friction at composition time collapses. Review velocity rises, defect rate rises in lockstep, and the correlation isn't obvious because nobody is tracking review-time-to-defect with the author class attached. The senior engineer who used to be the gravity well of taste in the codebase becomes the lone holdout in a culture quietly recalibrating around model deference.

The Prompt Author Identity Problem: Three Roles Editing the Same File

· 13 min read
Tian Pan
Software Engineer

Pull up the git blame on any year-old production system prompt and you will find something the engineering team is not ready to admit: the file has three authors, none of whom share a definition of what a "change" is. The engineer who refactored the instruction blocks last month logged the commit as "no functional change, just reordering for clarity." The product manager who reads the file once a quarter would describe the same diff as "you rewrote the voice — customers will notice." The ML engineer running the regression suite would call it "you broke few-shot example three, and the eval has been red ever since."

All three are right. The prompt is simultaneously code, spec, and hyperparameter, and every team that ships an AI feature long enough discovers that the file's commit history is a slow-motion three-way authorship dispute that CODEOWNERS does not capture and the diff viewer does not surface.

The Mixed PR Queue: Reviewer Throughput Is Now the Binding Constraint

· 10 min read
Tian Pan
Software Engineer

For the last twenty years, the Theory of Constraints answer in software delivery was the same: the bottleneck is producing code. We tooled around that assumption — pair programming, IDE autocompletion, faster CI, smaller services, all designed to push more code through a fixed-width review pipe. Then coding agents arrived, the production side of the pipe got 5–10x wider, and the review pipe stayed exactly the same width. A senior engineer who used to open three PRs a week now supervises a fleet that opens thirty in an afternoon. The team's velocity is no longer set by how fast anyone writes code. It's set by how fast a human can read it.

This is not a future problem. Median PR review time has been measured at +441% year over year in some samples, and 31% more PRs are merging with zero review — not by policy, but because reviewers gave up trying to keep pace. Stripe is shipping over a thousand agent-produced PRs per week. Feature-branch throughput grew 59% YoY in one benchmark while main-branch throughput fell 7% — code is being written, but it's not getting promoted, because it's stuck in review.

Spec, Code, Tests, One Author: The Independence You Quietly Lost

· 11 min read
Tian Pan
Software Engineer

When the same model writes the requirements, implements them, and authors the assertions that say it is correct, "all tests pass" is no longer evidence the feature works. It is evidence the model is internally consistent. Those are different things, and the difference is the entire point of having tests in the first place.

The standard story we tell about test suites is that they are a second opinion. The author wrote the code with one mental model of the requirement, the test author wrote the assertions with a slightly different mental model, and the points where the two models disagree are where the bugs live. That story depends on the test author having a different cognitive vantage point than the code author. Strip out the difference in vantage points and the test suite stops carrying any independent information about correctness — it only carries information about consistency.

Prompt Edits Aren't Wording Changes: A Code Review Discipline for Prompts as Software

· 11 min read
Tian Pan
Software Engineer

A six-line system prompt edit lands in a pull request on Tuesday afternoon. The diff is in plain English. Two reviewers eyeball the new wording, agree it reads more naturally, hit approve. The PR merges in under a minute. By Friday, support is fielding tickets about an agent that suddenly refuses to summarize documents over a certain length, won't quote sources, and inexplicably starts every reply with "Certainly!" — a behavior nobody asked for and the diff didn't predict.

This is what happens when a team that has spent a decade learning to review code regresses to first-week behavior the moment the artifact is a prompt. The diff looks harmless because it reads like English, and English is what humans review with their eyes. The discipline that makes code review work — running the tests, examining the blast radius, treating "small changes" with appropriate skepticism — quietly does not transfer. The wording got better; the behavior got worse; nobody noticed until users did.