Skip to main content

The Jagged Frontier: Why AI Fails at Easy Things and What It Means for Your Product

· 10 min read
Tian Pan
Software Engineer

A common assumption in AI product development goes something like this: if a model can handle a hard task, it can definitely handle an easier one nearby. This assumption is wrong, and it's responsible for a category of production failures that no amount of benchmark reading prepares you for.

The research term for the underlying phenomenon is the "jagged frontier" — AI's capability boundary isn't a smooth line that hard tasks sit outside of and easy tasks sit inside. It's a ragged, unpredictable shape. AI systems can write production-grade database query optimizers and still miscalculate whether two line segments on a diagram intersect. They can pass PhD-level science exams and fail children's riddle questions that involve spatial relationships. They can synthesize 50-page documents and then confidently hallucinate a summary of a paragraph they just read.

This jaggedness isn't a bug that will be patched in the next release. It reflects something structural about how these models learn, and it has direct consequences for how you should design, test, and ship AI-powered features.

Where the Concept Comes From

The term "jagged technological frontier" was coined in a field experiment conducted by researchers from Harvard Business School, MIT, and Wharton, published in Organization Science in 2025. The study enrolled 758 knowledge workers from Boston Consulting Group and had them complete realistic management consulting tasks — market analysis, synthesizing research, writing reports.

The finding everyone quotes: AI-assisted consultants completed tasks 25% faster, produced outputs rated 40% higher quality, and improved their success rate by 12.5 points. But the finding that matters for product teams is the second one: for tasks that fell outside the AI's capability frontier, consultants who used AI anyway were 19 percentage points less likely to produce correct solutions than consultants who worked without AI at all.

AI didn't just fail to help on those tasks. It made experienced professionals perform worse than they would have alone. The confidence and fluency of AI output masked the incorrectness.

What "Jagged" Actually Means in Practice

The capability frontier is not defined by task complexity in any way humans intuitively understand. The jaggedness comes from the distribution of training data, the nature of the objective function, and the specific failure modes of next-token prediction.

A few examples that illustrate the shape:

Where AI routinely exceeds human expert performance:

  • Writing, editing, and business ideation (AI-generated startup ideas rated better than business school students' by independent judges)
  • Emotional support and reappraisal (performs better than 85% of humans in controlled studies)
  • Reading comprehension on well-formatted documents
  • Code generation for common patterns (state-of-the-art agents now exceed 80% on SWE-bench)
  • Mathematical competition problems (o1-preview outperformed GPT-4o by 43 points on AIME 2024)

Where AI fails in ways that surprise practitioners:

  • Spatial reasoning: models that generate flawless geometry proofs fail when asked whether two rendered lines cross on an actual image
  • Sequential planning: Calendar scheduling, maze navigation, and constraint satisfaction problems show minimal improvement even with extended reasoning
  • Encoding and format edge cases: Both o1 and o3 fail silently on CSVs with hidden encoding issues
  • Visual perception of thin or suspended objects: Waymo's fifth-generation autonomous vehicle system was recalled after 1,212 vehicles collided with thin barriers — chains, utility poles, suspended gates — that its perception stack couldn't reliably detect despite excellent performance on standard obstacles

The pattern is not random. Tasks that have abundant, consistent training examples tend to sit inside the frontier. Tasks that require grounded, physical-world reasoning, or careful sequential logic without shortcuts, tend to sit outside it. But the exact shape is not predictable from first principles, which is the core problem for product teams.

The Product Design Traps

Once you understand the jagged frontier, several failure modes in AI product design become predictable.

The coherence assumption trap. Users who see a system handle a hard task conclude it can handle easier adjacent tasks. This is rational behavior in a world of human expertise, where capability is relatively smooth. It's wrong for AI. A writing assistant that produces excellent long-form analysis will not necessarily produce a reliable short summary of the same text. A code agent that passes complex algorithm challenges will still introduce subtle off-by-one errors in simple array traversal. Users who have learned to trust the system's hard-task output will extend that trust inappropriately and catch errors later — sometimes much later.

The over-reliance cascade. In the Harvard study, the damage from over-reliance wasn't that people used AI on tasks it was bad at. It's that AI's confident, fluent output suppressed the human's independent judgment. An experienced consultant who would have caught an error while working solo missed it because the AI's answer sounded authoritative. This is a UX design problem, not just a model problem. When your interface presents AI output as finished work rather than a draft requiring judgment, you're designing for over-reliance.

The under-utilization shadow. The same jaggedness that causes over-trust in some domains causes under-trust in others. Teams that have witnessed AI fail at tasks they expected it to handle will avoid AI for superficially similar tasks — including ones where AI would actually outperform the team. The capability cliff teaches the wrong lesson when users can't distinguish where the frontier actually runs.

The McDonald's problem. McDonald's deployed AI voice ordering at over 100 drive-throughs. The system performed adequately in controlled conditions. In production, background noise, regional accents, and edge-case orders pushed it outside the frontier. It placed absurdly incorrect orders — adding hundreds of dollars of incorrect items, making substitutions that didn't exist in the request — and the failures went viral. The program was pulled in July 2024. The error wasn't deploying AI voice ordering. The error was not knowing where the frontier was for that specific environment before shipping at scale.

How to Map the Frontier Before You Ship

Most teams don't map the capability frontier before shipping because they don't have a systematic way to do it. Here's a practical framework.

Task decomposition is the starting point. Decompose your feature's workflow into discrete subtasks. A document summarization feature might decompose into: extract key claims, verify factual consistency, condense without distortion, format for the target audience. Each subtask has a different position relative to the frontier. Don't evaluate the end-to-end output alone — that conflates tasks that are inside the frontier with tasks that are outside it, and you won't know which is which until a user finds the failure.

Build a capability inventory, not a demo. The demo is always the best case. The capability inventory is systematic: for each subtask, build at least 20 realistic test cases, drawn from messy real-world examples, not cleaned-up showcases. Run the model. Measure accuracy, not impression. Track which subtask types generate confident-but-wrong outputs — these are your frontier crossings.

Specifically probe for the failure modes that aren't obvious. Spatial and visual reasoning, sequential multi-step planning, format and encoding edge cases, and tasks involving temporal reasoning with real dates are disproportionately likely to sit outside the frontier even when the model performs well on nearby tasks. Build explicit tests for these even when your main use case doesn't seem to involve them. They appear as side effects more than as primary tasks.

Test on realistic, messy inputs. This is not the same as testing on hard inputs. Frontier crossings often happen not because the task is hard but because the input is slightly unusual — a different file encoding, an unusual name format, a multi-sentence question instead of a single-sentence one. Your eval set needs to include inputs that would be trivially handled by a human but represent edge cases in the training distribution.

Treat the frontier as dynamic. Models update. The frontier shifts — usually shrinking as bottleneck capabilities improve, but occasionally shifting in ways that move previously-inside tasks to the outside. The frontier you mapped last quarter is not necessarily the frontier today.

Designing Around the Frontier You Found

Once you know where the frontier runs in your feature set, you have two structural choices: design around the capability cliffs, or design verification into the cliff-crossing points.

Designing around means routing. Tasks that fall outside the frontier reliably should not be handed to the AI alone. The centaur pattern — a human-AI collaboration model where the human and AI each own the tasks they're better at, with a clear handoff boundary — outperforms both full automation and no automation in the research. This requires knowing the boundary, marking it explicitly in your product design, and making it easy for the human to pick up at that point without having to reconstruct context.

Designing verification in means friction. At cliff-crossing points — tasks where the model is capable most of the time but fails silently when it doesn't — the product needs to surface the output for human review rather than presenting it as final. The key design principle is reducing the cognitive load of verification: the human should be able to spot errors quickly, not validate from scratch. Diff views, confidence indicators, citations back to source material, and forced acknowledgment flows all serve this function. The goal is not to make humans verify everything. It's to make verification fast and reliable at the specific points where the frontier is ragged.

Don't hide the frontier from users. There's a strong product management instinct to minimize the visible limitations of AI features. This instinct produces over-reliance. Users who understand that the system is excellent at synthesis but unreliable at extracting exact figures will use it appropriately. Users who believe the system is uniformly capable will not catch the figure extraction errors until the downstream consequences appear. Transparency about the frontier isn't a weakness in your product narrative. It's the mechanism that keeps your users' trust calibrated to reality.

The Frontier Is Narrowing, But It Isn't Going Away

Ethan Mollick, who helped coin the jagged frontier concept, has observed that the frontier is narrowing — tasks that clearly illustrated AI's gaps twelve months ago have largely been resolved. o3 and Gemini 2.5 handle tasks that would have required careful human judgment when the original frontier research was published.

This creates a natural temptation to treat the jagged frontier as a temporary problem. It isn't. The frontier is structural. As models become capable enough to handle the tasks that currently define the frontier's rough edges, new capabilities develop and new rough edges appear. The specific shape changes; the jaggedness persists.

The implication for engineers is that frontier mapping is not a one-time activity before launch. It's an ongoing operational process, the same way you'd continuously monitor model output quality or latency. The frontier you're shipping against today will be different from the frontier your feature runs against after the next major model update. Build the measurement infrastructure to detect when it moves.

The teams that ship AI products safely are not the ones who assume capability is coherent. They're the ones who build the discipline of finding out exactly where it isn't.

References:Let's stay in touch and Follow me for more thoughts and updates