Agentic Coding in Production: What SWE-bench Scores Don't Tell You
When a frontier model scores 80% on SWE-bench Verified, it sounds like a solved problem. Four out of five real GitHub issues, handled autonomously. Ship it to your team. Except: that same model, on SWE-bench Pro — a benchmark specifically designed to resist contamination with long-horizon tasks from proprietary codebases — scores 23%. And a rigorous controlled study of experienced developers found that using AI coding tools made them 19% slower, not faster.
These numbers aren't contradictions. They're the gap between what benchmarks measure and what production software engineering actually requires. If you're building or buying into agentic coding tools, that gap is the thing worth understanding.
Why SWE-bench Stopped Meaning Anything
SWE-bench Verified became the de facto standard for evaluating AI coding agents. It's a reasonable benchmark on paper: take real GitHub issues from open-source Python repos, give the agent the issue text and codebase, and measure whether its fix passes the existing test suite.
The problem is what happened to scores over 18 months. In August 2024, top scores were around 20%. By early 2026, multiple systems exceeded 80%. That's not because AI coding capability increased fourfold — it's because the benchmark was gamed.
OpenAI retired SWE-bench Verified as their frontier evaluation, stating plainly that "improvements no longer reflect meaningful improvements in models' real-world software development abilities, but increasingly reflect how much the model was exposed to the benchmark at training time." The benchmark's problems are public, the solutions exist on GitHub, and any model trained on GitHub data after mid-2024 has likely seen a substantial portion of them. At least 59% of audited problems have flawed test cases that reject functionally correct submissions. Frontier models can reproduce the original human-written bug fixes verbatim.
The same model can score 69% standalone or 81% with a sophisticated agent harness that retries failures and explores files iteratively. You can't separate model capability from system engineering.
Scale AI's SWE-bench Pro attempts to fix this with 1,865 tasks requiring multi-step reasoning across proprietary and held-out codebases. Top frontier models score 23% on it. That's the real number. The gap between 80% and 23% is what benchmark saturation looks like.
A separate analysis from Epoch AI found that 87% of SWE-bench problems are bug fixes, over 80% come from five Python repositories, half the issues predate 2020, and the median task is something an experienced engineer could complete in under an hour. Multi-file changes, architectural decisions, and ambiguous requirements — the majority of real engineering work — are largely absent.
What Controlled Studies Found
The most rigorous real-world evaluation to date came from METR, an AI safety research organization. They recruited 16 experienced developers who actively maintain large open-source repositories (averaging 1 million lines of code, 22,000 GitHub stars). Developers completed 246 tasks averaging two hours each, using Cursor Pro with Claude 3.5/3.7 Sonnet.
Result: AI use increased task completion time by 19%.
The perception gap is what makes this data particularly valuable. Before the study, developers predicted AI would save them 24% of time. After completing all 246 tasks with objective time measurements, they still believed it had saved them 20%. The subjective experience of using AI tools — the feeling of moving faster, having suggestions generated instantly — creates a persistent illusion of productivity even when the objective data runs the opposite direction.
Developers accepted fewer than 44% of AI-generated code suggestions. Much of their time was spent understanding what the agent had done, verifying it was correct, and cleaning up outputs that didn't fit the existing codebase's conventions.
This is a result about a specific condition: experienced developers, working in mature and complex codebases, on tasks requiring deep familiarity with existing code. The authors explicitly state it doesn't generalize to all contexts. But it's the condition that high-stakes production engineering most resembles.
Faros AI analyzed production data across real organizations and found a parallel paradox. Developers in high-AI-adoption teams merged 98% more pull requests. But PR review time increased 91%, PRs grew 154% larger on average, and bugs per developer increased 9%. The net result: no measurable company-level performance improvement. Individual velocity increased while system throughput stagnated, because code review and testing pipelines can't match AI-accelerated generation rates.
Where Agents Actually Deliver
None of this means coding agents don't work. It means they don't work uniformly across task types. The pattern is consistent across every data source that's honest about results.
Security vulnerability remediation with clear specs is a genuine win. Devin reports remediating vulnerabilities in 1.5 minutes versus 30 minutes for human engineers — a 20x improvement. The key precondition: the vulnerability type is known and the remediation pattern is well-defined. An agent can apply "add input validation here, use parameterized queries there" at machine speed when it knows exactly what to fix and how.
Database migrations and modernization show similar gains. Teams migrating Java applications to newer versions or moving between data frameworks report 10x to 14x speed improvements. The task structure is what makes this work: clear starting state, clear ending state, mechanical transformation pattern, verifiable output. The agent doesn't need to make judgment calls.
Boilerplate and scaffolding work well for the same reason. CRUD layers, API route handlers, serializer classes, test fixtures — these follow established patterns that agents can apply reliably. The code is low-risk and easy to verify.
Bug fixes with reproducible test cases are reliably solvable. When there's a failing test and a clear hypothesis about the cause, agents can iterate through fixes efficiently. Devin's PR merge rate improved from 34% to 67% over 18 months — still one in three rejected, but a genuine improvement. Cognition's own internal application sees roughly one-third of commits coming from the agent.
The common thread: narrow scope, clear success criteria, verifiable output, mechanical transformation. These are the conditions where benchmark evaluation and real-world performance actually converge.
- https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
- https://labs.scale.com/leaderboard/swe_bench_pro_public
- https://epoch.ai/blog/what-skills-does-swe-bench-verified-evaluate/
- https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
- https://arxiv.org/abs/2507.09089
- https://www.faros.ai/blog/ai-software-engineering
- https://survey.stackoverflow.co/2025/ai/
- https://cognition.ai/blog/devin-annual-performance-review-2025
- https://www.anthropic.com/news/how-anthropic-teams-use-claude-code
- https://daplab.cs.columbia.edu/general/2026/01/08/9-critical-failure-patterns-of-coding-agents.html
- https://www.veracode.com/blog/genai-code-security-report/
- https://codegen.com/blog/how-to-build-agentic-coding-workflows/
- https://github.blog/ai-and-ml/github-copilot/github-copilot-coding-agent-101-getting-started-with-agentic-workflows-on-github/
- https://stackoverflow.blog/2026/01/28/are-bugs-and-incidents-inevitable-with-ai-coding-agents/
- https://arxiv.org/abs/2509.16941
