Skip to main content

The Intent Gap: When Your LLM Answers the Wrong Question Perfectly

· 9 min read
Tian Pan
Software Engineer

Intent misalignment is the single largest failure category in production LLM systems — responsible for 32% of all dissatisfactory responses, according to a large-scale analysis of real user interactions. It's not hallucination, not refusal, not format errors. It's models answering a question correctly while missing entirely what the user actually needed.

This is the intent gap: the distance between what a user says and what they mean. It's invisible to most eval suites, invisible to error logs, and invisible to the users themselves until they've wasted enough cycles to realize the output was technically right but practically useless.

The Four Layers of What a User Actually Wants

The most useful framework for thinking about intent comes from how leading AI labs have trained their models to parse user requests. There are four distinct layers to any user input:

Immediate desires are the literal request — what the user typed. Fix this bug. Summarize this document. Write a regex for email validation.

Final goals are the deeper motivation. The user asking to fix a bug wants a working program. They want to ship by Friday. They want to not be paged again at 3am.

Background desiderata are the unstated constraints that the user assumes you know: don't change my language from Python to Ruby. Don't rewrite the entire function because you found a cleaner approach. Keep the existing API surface.

Autonomy is the user's right to make their own decisions, even when you disagree. If a user asks you to implement something in a way you consider suboptimal, the right response isn't silent substitution.

Production LLMs are excellent at immediate desires. They are mediocre at final goals. They frequently violate background desiderata. The result is outputs that land in an uncanny valley: technically responsive, practically unhelpful.

The Numbers Are Worse Than You'd Expect

Intent misalignment isn't a minor edge case. A 2023 study analyzing real ChatGPT dissatisfaction patterns found it accounted for 32.18% of all negative responses — the single largest category, ahead of factual errors and refusals. The average user dissatisfaction score for intent failures was 5.56 out of 10, and only 67% of these failures were resolved even after user correction.

Multi-turn conversations make it dramatically worse. A 2026 study on multi-turn intent found LLMs suffer approximately a 30% performance drop in multi-turn versus single-turn settings, consistently across model families and sizes. This isn't a capability problem that newer models solve. The researchers described it as structural: the longer a conversation runs, the more context accumulates, and the more likely the model is to follow the most recent statement rather than maintaining coherent awareness of the underlying goal.

State-of-the-art models achieve only 40% success on benchmarks measuring proactive problem-solving — the ability to identify and address the underlying issue without being explicitly asked. Salesforce's AI deployments in customer experience showed single-turn success rates of 58% that collapsed to 35% in multi-turn resolution tasks.

Coding agents show a specific version of this. Misclassification rates for user intent scale from 11.7% with 5 possible intents to 30% with 160 intents — roughly linear degradation as system complexity grows.

The XY Problem, Reproduced at Scale

Software engineers have a name for a specific variant of the intent gap: the XY problem. The user encounters problem X, develops a hypothesis that Y would solve it, then asks about Y — never mentioning X. A good engineer notices the mismatch and asks what you're actually trying to do. LLMs do not.

A documented example: a user asked a frontier model to "echo the last three characters of a filename." Both tested models did exactly that. Neither questioned whether the actual need was to extract the file extension — a semantically adjacent but structurally different operation that would handle edge cases correctly. The user got what they asked for, not what they needed.

This failure mode scales across entire codebases. When coding agents operate over many turns, each individually reasonable decision accumulates into what one practitioner called "slop creep" — gradual architectural degradation caused by an agent that has no memory of design history and no model of long-term consequences. Every change is technically defensible. The result is a codebase that drifts.

The underlying cause is that modern LLMs are optimized to be helpful and productive. Being helpful means answering questions. Being productive means not asking unnecessary clarifying questions. These incentives point directly at the intent gap.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates