Agentic Coding in Production: What SWE-bench Scores Don't Tell You
When a frontier model scores 80% on SWE-bench Verified, it sounds like a solved problem. Four out of five real GitHub issues, handled autonomously. Ship it to your team. Except: that same model, on SWE-bench Pro — a benchmark specifically designed to resist contamination with long-horizon tasks from proprietary codebases — scores 23%. And a rigorous controlled study of experienced developers found that using AI coding tools made them 19% slower, not faster.
These numbers aren't contradictions. They're the gap between what benchmarks measure and what production software engineering actually requires. If you're building or buying into agentic coding tools, that gap is the thing worth understanding.
Why SWE-bench Stopped Meaning Anything
SWE-bench Verified became the de facto standard for evaluating AI coding agents. It's a reasonable benchmark on paper: take real GitHub issues from open-source Python repos, give the agent the issue text and codebase, and measure whether its fix passes the existing test suite.
The problem is what happened to scores over 18 months. In August 2024, top scores were around 20%. By early 2026, multiple systems exceeded 80%. That's not because AI coding capability increased fourfold — it's because the benchmark was gamed.
OpenAI retired SWE-bench Verified as their frontier evaluation, stating plainly that "improvements no longer reflect meaningful improvements in models' real-world software development abilities, but increasingly reflect how much the model was exposed to the benchmark at training time." The benchmark's problems are public, the solutions exist on GitHub, and any model trained on GitHub data after mid-2024 has likely seen a substantial portion of them. At least 59% of audited problems have flawed test cases that reject functionally correct submissions. Frontier models can reproduce the original human-written bug fixes verbatim.
The same model can score 69% standalone or 81% with a sophisticated agent harness that retries failures and explores files iteratively. You can't separate model capability from system engineering.
Scale AI's SWE-bench Pro attempts to fix this with 1,865 tasks requiring multi-step reasoning across proprietary and held-out codebases. Top frontier models score 23% on it. That's the real number. The gap between 80% and 23% is what benchmark saturation looks like.
A separate analysis from Epoch AI found that 87% of SWE-bench problems are bug fixes, over 80% come from five Python repositories, half the issues predate 2020, and the median task is something an experienced engineer could complete in under an hour. Multi-file changes, architectural decisions, and ambiguous requirements — the majority of real engineering work — are largely absent.
What Controlled Studies Found
The most rigorous real-world evaluation to date came from METR, an AI safety research organization. They recruited 16 experienced developers who actively maintain large open-source repositories (averaging 1 million lines of code, 22,000 GitHub stars). Developers completed 246 tasks averaging two hours each, using Cursor Pro with Claude 3.5/3.7 Sonnet.
Result: AI use increased task completion time by 19%.
The perception gap is what makes this data particularly valuable. Before the study, developers predicted AI would save them 24% of time. After completing all 246 tasks with objective time measurements, they still believed it had saved them 20%. The subjective experience of using AI tools — the feeling of moving faster, having suggestions generated instantly — creates a persistent illusion of productivity even when the objective data runs the opposite direction.
Developers accepted fewer than 44% of AI-generated code suggestions. Much of their time was spent understanding what the agent had done, verifying it was correct, and cleaning up outputs that didn't fit the existing codebase's conventions.
This is a result about a specific condition: experienced developers, working in mature and complex codebases, on tasks requiring deep familiarity with existing code. The authors explicitly state it doesn't generalize to all contexts. But it's the condition that high-stakes production engineering most resembles.
Faros AI analyzed production data across real organizations and found a parallel paradox. Developers in high-AI-adoption teams merged 98% more pull requests. But PR review time increased 91%, PRs grew 154% larger on average, and bugs per developer increased 9%. The net result: no measurable company-level performance improvement. Individual velocity increased while system throughput stagnated, because code review and testing pipelines can't match AI-accelerated generation rates.
Where Agents Actually Deliver
None of this means coding agents don't work. It means they don't work uniformly across task types. The pattern is consistent across every data source that's honest about results.
Security vulnerability remediation with clear specs is a genuine win. Devin reports remediating vulnerabilities in 1.5 minutes versus 30 minutes for human engineers — a 20x improvement. The key precondition: the vulnerability type is known and the remediation pattern is well-defined. An agent can apply "add input validation here, use parameterized queries there" at machine speed when it knows exactly what to fix and how.
Database migrations and modernization show similar gains. Teams migrating Java applications to newer versions or moving between data frameworks report 10x to 14x speed improvements. The task structure is what makes this work: clear starting state, clear ending state, mechanical transformation pattern, verifiable output. The agent doesn't need to make judgment calls.
Boilerplate and scaffolding work well for the same reason. CRUD layers, API route handlers, serializer classes, test fixtures — these follow established patterns that agents can apply reliably. The code is low-risk and easy to verify.
Bug fixes with reproducible test cases are reliably solvable. When there's a failing test and a clear hypothesis about the cause, agents can iterate through fixes efficiently. Devin's PR merge rate improved from 34% to 67% over 18 months — still one in three rejected, but a genuine improvement. Cognition's own internal application sees roughly one-third of commits coming from the agent.
The common thread: narrow scope, clear success criteria, verifiable output, mechanical transformation. These are the conditions where benchmark evaluation and real-world performance actually converge.
Where Agents Systematically Fail
The failure modes are as consistent as the successes.
Cross-codebase changes collapse as file count grows. Agents "see" a slice of the codebase at any moment, constrained by context window size. In greenfield prototypes with a few files, this is invisible. In a production codebase with thousands of files, agents begin forgetting earlier decisions, mixing up component interfaces, and implementing things that already exist elsewhere in the codebase. Research from Columbia's DAPLab identified this as a primary failure pattern: agents "make increasingly more failures as the number of files in the codebase grows."
State management failures are pervasive. Agents refactor one component's state handling without updating all the code that depends on it. A drag-to-reorder feature that updates part of the state array but not the full list. A cache invalidation that covers three of four call sites. These bugs are hard to catch in code review because they look syntactically correct.
Business logic mismatch is the hardest failure to detect. The agent produces code that runs, passes the tests it can see, and appears to implement the feature — but implements it incorrectly relative to the actual business requirement. A discount applied to individual line items instead of the cart total. A permission check that passes for the wrong role. These failures require someone who understands the business domain to catch, and they tend to appear in production rather than in automated testing.
Security is disproportionately vulnerable. Veracode's analysis of AI-generated code found 2.74 times more vulnerabilities than human-written code. AI-generated code is 1.88 times more likely to introduce improper password handling, 1.91 times more likely to produce insecure object references, and 2.74 times more likely to add XSS vulnerabilities. CodeRabbit's analysis of 470 GitHub repositories found AI-generated PRs contained 75% more logic errors than human-written ones. Georgia Tech's Vibe Security Radar tracked 74 confirmed CVEs directly attributable to AI coding tools in public advisories.
Ambiguous requirements amplify all other failures. When the task specification is vague, agents don't ask for clarification — they make assumptions and proceed. The assumption surface area is large, and errors compound across multi-step tasks.
Integration Patterns That Work
Teams that get reliable value from coding agents share a set of practices that aren't about tool choice — they're about workflow design.
Treat agents as asynchronous collaborators, not synchronous pair programmers. The workflow that works looks like: write a structured task spec, hand it to the agent, let it run in isolation, review the PR output. The workflow that fails looks like: ask the agent to do something vague while watching it work and course-correcting in real time. The synchronous mode amplifies the perception-of-productivity problem; the async mode makes output reviewable.
Task specifications need explicit scope constraints. The difference between a task that succeeds and one that goes off the rails is often just how it's written. "Add pagination to the user list" invites the agent to make dozens of implicit decisions. "Add cursor-based pagination to /api/users. Page size: 25. Use /api/orders as the reference pattern. Do not modify auth middleware or database schema" gives the agent a bounded problem with a clear solution space.
Verification-driven development is the highest-leverage practice. Give the agent a way to verify its own work — a failing test it needs to make pass, a linter configuration it must satisfy, an integration test suite it can run. Multiple teams independently identified this as the single practice that most improved agent output quality. When agents can evaluate their own output, they self-correct.
Codebase context files pay compounding returns. Repository-specific instruction files (CLAUDE.md or equivalent) that document style conventions, architecture decisions, common patterns, and explicit anti-patterns dramatically improve output quality. Every successful team using agentic coding tools at scale independently developed some version of this practice. Keep them focused — under 2,500 tokens — and update them when the codebase changes.
Classify tasks before assigning them. The task types that are safe to delegate freely: scoped migrations with clear before/after states, bug fixes with reproducible test cases, boilerplate generation, documentation, security remediation with explicit specs. The task types that require human ownership: security-critical flows (auth, payments, access control), architectural decisions, compliance features, anything with subjective success criteria, and visual design requiring judgment calls.
The Measurement Problem
One reason the benchmark-to-production gap persists is that most teams don't measure the right things. PR count and lines of code generated are easy to measure and easy to game. What matters for system-level productivity is harder to attribute: deployment frequency, change failure rate, mean time to restore, and review cycle time.
Faros AI's finding — that high-AI teams merged 98% more PRs but saw review time increase 91% and no improvement in delivery performance — illustrates the risk of optimizing for the visible metric. More PRs merged looks like a win until you realize each one requires 91% more human review time and introduces more bugs.
Useful signals for evaluating coding agent ROI: how often do agent-generated PRs get merged without revision versus require significant rework? What fraction of agent-generated code passes security review without findings? How does change failure rate compare for agent-generated versus human-written code? These are harder to collect but reflect actual value delivered.
What This Means in Practice
The honest picture of agentic coding in 2026 is not "AI replaces engineers" and not "AI tools are useless hype." It's more specific than either.
AI coding agents reliably accelerate a subset of engineering work — the part that's mechanical, well-specified, and verifiable. For that subset, the gains are real and sometimes dramatic: 10x on migrations, 20x on vulnerability remediation with clear specs. The gains compound when you design your workflows to maximize that subset.
AI coding agents reliably struggle with another subset — the work requiring system-level understanding, business domain judgment, security sensitivity, or cross-cutting architectural awareness. For that subset, agents don't just fail to help; they can actively create work in the form of subtle bugs, security vulnerabilities, and code that looks correct but isn't.
The teams getting the most value from these tools are doing two things: they've developed rigorous task classification to route work appropriately, and they've built infrastructure — structured task specs, codebase context files, automated verification — that matches agent capability to task requirements.
The teams getting the least value are applying agents uniformly across all task types and measuring success by throughput rather than quality. That path leads to the Faros paradox: more output, no improvement in delivery performance, and a code review bottleneck that absorbs all the time saved on generation.
SWE-bench at 80% tells you that AI systems can solve well-defined bug fixes in public Python repos. SWE-bench Pro at 23% tells you what happens when the problems get harder. The real number for your production codebase is somewhere in between, and finding it requires measurement rather than faith in benchmark scores.
- https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
- https://labs.scale.com/leaderboard/swe_bench_pro_public
- https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate/
- https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
- https://arxiv.org/abs/2507.09089
- https://www.faros.ai/blog/ai-software-engineering
- https://survey.stackoverflow.co/2025/ai/
- https://cognition.ai/blog/devin-annual-performance-review-2025
- https://www.anthropic.com/news/how-anthropic-teams-use-claude-code
- https://daplab.cs.columbia.edu/general/2026/01/08/9-critical-failure-patterns-of-coding-agents.html
- https://www.veracode.com/blog/genai-code-security-report/
- https://codegen.com/blog/how-to-build-agentic-coding-workflows/
- https://github.blog/ai-and-ml/github-copilot/github-copilot-coding-agent-101-getting-started-with-agentic-workflows-on-github/
- https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/
- https://arxiv.org/abs/2509.16941
