13 posts tagged with "coding-agents"

The Mixed PR Queue: Reviewer Throughput Is Now the Binding Constraint

May 10, 2026 · 10 min read

Software Engineer

For the last twenty years, the Theory of Constraints answer in software delivery was the same: the bottleneck is producing code. We tooled around that assumption — pair programming, IDE autocompletion, faster CI, smaller services, all designed to push more code through a fixed-width review pipe. Then coding agents arrived, the production side of the pipe got 5–10x wider, and the review pipe stayed exactly the same width. A senior engineer who used to open three PRs a week now supervises a fleet that opens thirty in an afternoon. The team's velocity is no longer set by how fast anyone writes code. It's set by how fast a human can read it.

This is not a future problem. Median PR review time has been measured at +441% year over year in some samples, and 31% more PRs are merging with zero review — not by policy, but because reviewers gave up trying to keep pace. Stripe is shipping over a thousand agent-produced PRs per week. Feature-branch throughput grew 59% YoY in one benchmark while main-branch throughput fell 7% — code is being written, but it's not getting promoted, because it's stuck in review.

The AI Code Review Inversion: What to Focus on When the Author Is a Machine

May 7, 2026 · 9 min read

Tian Pan

Software Engineer

Your code review is optimizing for the wrong thing. When AI agents contribute the majority of your commits, reviewing for local correctness — does this function do what it says? — is like grading a math test by checking the handwriting. The machine already passed your linter, ran your test suite, and formatted the output to spec. The bugs it ships are not the bugs line-by-line review catches.

A large-scale study of GitHub pull requests found that AI-co-authored PRs contain 1.7x more issues than human-only PRs — including 75% more logic and correctness issues, 2.74x more security vulnerabilities, and 3x more readability problems. Not because the code looks wrong. Because it does the wrong thing, in the wrong place, with the wrong assumptions about the rest of the system. Those are precisely the failure modes that traditional code review, optimized for catching typos and style violations, is not designed to find.

The Coding Agent Autonomy Curve: Reading Is Free, Merging Is Incident-Class

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The discourse on coding agents keeps collapsing to a binary: autonomous or supervised, YOLO mode or hand-on-the-wheel, --dangerously-skip-permissions or "approve every keystroke." That framing is a category error. A coding agent does not perform "an action." It performs a sequence of actions whose costs span at least seven orders of magnitude — from reading a file (free, undoable, no side effect) to merging to main (irreversible without a revert PR) to rolling out a binary to a fleet (six-figure incident-class). Treating that range with one autonomy switch is like setting a single speed limit for both a parking lot and a freeway.

The team that ships "the agent can do everything" without mapping each action to its blast radius is one prompt-injection-bearing GitHub comment away from a postmortem — and we already have public examples of that exact failure mode. Anthropic's Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent were all confirmed in 2026 to be hijackable through specially crafted PR titles and issue bodies, in an attack pattern the researchers named "Comment and Control." The agents weren't broken in some abstract sense. They executed a high-tier action — pushing code, opening a PR — on the basis of a low-trust input the autonomy tier had silently flattened into "all the same."

What follows is the discipline that has to land: a per-action curve, gates that scale with the tier, rollback velocity matched to blast class, and an eval program that tests for tool-composition escalation rather than single-action failure.

The Unmergeable Agentic Refactor: Why Multi-File Diffs Break at the Seam

April 23, 2026 · 9 min read

Tian Pan

Software Engineer

A 40-file refactor from a coding agent lands on your desk. You open the PR, scroll through the diff, and every hunk looks fine. The rename is consistent, the imports are tidy, the tests compile in isolation. You merge. Forty minutes later, CI on main goes red because two call sites in a sibling package still pass three arguments to a function that now takes four, and the type checker that would have caught it was never part of the agent's inner loop.

This is the most common failure mode in agent-authored refactors today, and it has almost nothing to do with the quality of the individual edits. Each file, reviewed on its own, looks like something a careful human would have written. The bug lives at the seams — the boundaries where edits from different files have to agree. File-level review hides seam-level correctness, and most review workflows were designed around files.

AI Coding Agents on Legacy Codebases: What Works and What Backfires

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most AI coding demos show an agent building a greenfield Todo app or implementing a clean API from scratch. Your codebase, however, is a fifteen-year-old monolith with undocumented implicit contracts, deprecated dependencies that three teams depend on in ways nobody fully understands, and a service layer that started as a single class and now spans forty files. The gap between demo and reality is not just a size problem — it's a structural one, and understanding it before you hand your agents the keys prevents a specific category of subtle, expensive failures.

AI coding agents genuinely help with legacy systems, but only within certain task boundaries. Outside those boundaries, they don't just fail noisily — they produce plausible-looking, syntactically valid, semantically wrong changes that slip through code review and surface in production.

The AI-Generated Code Maintenance Trap: What Teams Discover Six Months Too Late

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

The pattern is almost universal across teams that adopted coding agents in 2023 and 2024. In month one, velocity doubles. In month three, management holds up the productivity metrics as evidence that AI investment is paying off. By month twelve, the engineering team can't explain half the codebase to new hires, refactoring has become prohibitively expensive, and engineers spend more time debugging AI-generated code than they would have spent writing it by hand.

This isn't a story about AI code being secretly bad. It's a story about how the quality characteristics of AI-generated code systematically defeat the organizational practices teams already had in place — and how those practices need to change before the debt compounds beyond recovery.

Coding Agents in the Monorepo: Why Context Windows and 50-Service Repos Don't Mix

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

Here's a failure mode that happens silently: you ask a coding agent to update the authentication service's token refresh endpoint. The agent produces clean-looking code — confident, well-commented, type-safe. It also calls a method signature that was renamed three months ago in a shared library three directories up. The tests for that endpoint pass because the mock still uses the old signature. The bug surfaces in staging when the real library gets pulled in.

This isn't a hallucination in the abstract sense. The model knew about that method — it existed somewhere in the training data or was briefly visible in context. The problem is architectural: the agent never had access to the current version of the interface it was calling.

The Deprecated API Trap: Why AI Coding Agents Break on Library Updates

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI coding agent just generated a pull request. The code looks right. It compiles. Tests pass. You merge it. Two days later, your CI pipeline in staging starts throwing AttributeError: module 'openai' has no attribute 'ChatCompletion'. The agent used an API pattern that was deprecated a year ago and removed in the latest major version.

This is the deprecated API trap, and it bites teams far more often than the conference talks about AI code quality suggest. An empirical study evaluating seven frontier LLMs across 145 API mappings found that most models exhibit API Usage Plausibility (AUP) below 30% across popular Python libraries. When explicitly given deprecated context, all tested models demonstrated 70–90% deprecated usage rates. The problem is structural, not a quirk of a particular model or library.

Machine-Readable Project Context: Why Your CLAUDE.md Matters More Than Your Model

April 14, 2026 · 8 min read

Tian Pan

Software Engineer

Most teams that adopt AI coding agents spend the first week arguing about which model to use. They benchmark Opus vs. Sonnet vs. GPT-4o on contrived examples, obsess over the leaderboard, and eventually pick something. Then they spend the next three months wondering why the agent keeps rebuilding the wrong abstractions, ignoring their test strategy, and repeatedly asking which package manager to use.

The model wasn't the problem. The context file was.

Every AI coding tool — Claude Code, Cursor, GitHub Copilot, Windsurf — reads a project-specific markdown file at the start of each session. These files go by different names: CLAUDE.md, .cursor/rules/, .github/copilot-instructions.md, AGENTS.md. But they share the same purpose: teaching the agent what it cannot infer from reading the code alone. The quality of this file now predicts output quality more reliably than the model behind it. Yet most teams write them once, badly, and never touch them again.

Measuring Real AI Coding Productivity: The Metrics That Survive the 90-Day Lag

April 14, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams adopting AI coding tools hit the same wall. Month one looks like a success story: PR throughput is up, sprint velocity is climbing, and the engineering manager is putting together a slide deck to share with leadership. By month three, something has quietly gone wrong. Incidents creep up. Senior engineers are spending more time in review. A simple bug fix now requires understanding code nobody on the team actually wrote. The productivity gains have evaporated — but the measurement system never caught it.

The problem is that the metrics most teams reach for first — lines generated, PRs merged, story points burned — are the wrong unit of measurement for AI-assisted development. They measure the cost of producing code, not the cost of owning it. And AI has made production nearly free while leaving ownership costs untouched.

Agentic Coding in Production: What SWE-bench Scores Don't Tell You

April 9, 2026 · 11 min read

Tian Pan

Software Engineer

When a frontier model scores 80% on SWE-bench Verified, it sounds like a solved problem. Four out of five real GitHub issues, handled autonomously. Ship it to your team. Except: that same model, on SWE-bench Pro — a benchmark specifically designed to resist contamination with long-horizon tasks from proprietary codebases — scores 23%. And a rigorous controlled study of experienced developers found that using AI coding tools made them 19% slower, not faster.

These numbers aren't contradictions. They're the gap between what benchmarks measure and what production software engineering actually requires. If you're building or buying into agentic coding tools, that gap is the thing worth understanding.

CLAUDE.md and AGENTS.md: The Configuration Layer That Makes AI Coding Agents Actually Follow Your Rules

February 25, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI coding agent doesn't remember yesterday. Every session starts cold — it doesn't know you use yarn not npm, that you avoid any types, or that the src/generated/ directory is sacred and should never be edited by hand. So it generates code with the wrong package manager, introduces any where you've banned it, and occasionally overwrites generated files you'll spend an hour recovering. You correct it. Tomorrow it makes the same mistake. You correct it again.

This is not a model quality problem. It's a configuration problem — and the fix is a plain Markdown file.

CLAUDE.md, AGENTS.md, and their tool-specific cousins are the briefing documents AI coding agents read before every session. They encode what the agent would otherwise have to rediscover or be corrected on: which commands to run, which patterns to avoid, how your team's workflow is structured, and which directories are off-limits. They're the equivalent of a thorough engineering onboarding document, compressed into a form optimized for machine consumption.

About Tian Pan