Skip to main content

The Intent Gap: When Your LLM Answers the Wrong Question Perfectly

· 9 min read
Tian Pan
Software Engineer

Intent misalignment is the single largest failure category in production LLM systems — responsible for 32% of all dissatisfactory responses, according to a large-scale analysis of real user interactions. It's not hallucination, not refusal, not format errors. It's models answering a question correctly while missing entirely what the user actually needed.

This is the intent gap: the distance between what a user says and what they mean. It's invisible to most eval suites, invisible to error logs, and invisible to the users themselves until they've wasted enough cycles to realize the output was technically right but practically useless.

The Four Layers of What a User Actually Wants

The most useful framework for thinking about intent comes from how leading AI labs have trained their models to parse user requests. There are four distinct layers to any user input:

Immediate desires are the literal request — what the user typed. Fix this bug. Summarize this document. Write a regex for email validation.

Final goals are the deeper motivation. The user asking to fix a bug wants a working program. They want to ship by Friday. They want to not be paged again at 3am.

Background desiderata are the unstated constraints that the user assumes you know: don't change my language from Python to Ruby. Don't rewrite the entire function because you found a cleaner approach. Keep the existing API surface.

Autonomy is the user's right to make their own decisions, even when you disagree. If a user asks you to implement something in a way you consider suboptimal, the right response isn't silent substitution.

Production LLMs are excellent at immediate desires. They are mediocre at final goals. They frequently violate background desiderata. The result is outputs that land in an uncanny valley: technically responsive, practically unhelpful.

The Numbers Are Worse Than You'd Expect

Intent misalignment isn't a minor edge case. A 2023 study analyzing real ChatGPT dissatisfaction patterns found it accounted for 32.18% of all negative responses — the single largest category, ahead of factual errors and refusals. The average user dissatisfaction score for intent failures was 5.56 out of 10, and only 67% of these failures were resolved even after user correction.

Multi-turn conversations make it dramatically worse. A 2026 study on multi-turn intent found LLMs suffer approximately a 30% performance drop in multi-turn versus single-turn settings, consistently across model families and sizes. This isn't a capability problem that newer models solve. The researchers described it as structural: the longer a conversation runs, the more context accumulates, and the more likely the model is to follow the most recent statement rather than maintaining coherent awareness of the underlying goal.

State-of-the-art models achieve only 40% success on benchmarks measuring proactive problem-solving — the ability to identify and address the underlying issue without being explicitly asked. Salesforce's AI deployments in customer experience showed single-turn success rates of 58% that collapsed to 35% in multi-turn resolution tasks.

Coding agents show a specific version of this. Misclassification rates for user intent scale from 11.7% with 5 possible intents to 30% with 160 intents — roughly linear degradation as system complexity grows.

The XY Problem, Reproduced at Scale

Software engineers have a name for a specific variant of the intent gap: the XY problem. The user encounters problem X, develops a hypothesis that Y would solve it, then asks about Y — never mentioning X. A good engineer notices the mismatch and asks what you're actually trying to do. LLMs do not.

A documented example: a user asked a frontier model to "echo the last three characters of a filename." Both tested models did exactly that. Neither questioned whether the actual need was to extract the file extension — a semantically adjacent but structurally different operation that would handle edge cases correctly. The user got what they asked for, not what they needed.

This failure mode scales across entire codebases. When coding agents operate over many turns, each individually reasonable decision accumulates into what one practitioner called "slop creep" — gradual architectural degradation caused by an agent that has no memory of design history and no model of long-term consequences. Every change is technically defensible. The result is a codebase that drifts.

The underlying cause is that modern LLMs are optimized to be helpful and productive. Being helpful means answering questions. Being productive means not asking unnecessary clarifying questions. These incentives point directly at the intent gap.

Why Your Evals Won't Catch This

Intent failures are specifically designed to evade standard evaluation methodologies.

The most common eval approaches — accuracy on held-out benchmarks, LLM-as-judge quality ratings, user satisfaction surveys — all measure the wrong thing. They assess whether the output is good, not whether it addresses the right goal.

Benchmark saturation makes this worse. LLMs now achieve near-ceiling scores on instruction-following benchmarks like IFEval, yet performance drops up to 61.8% when subtle prompt modifications are introduced that reflect real-world usage patterns. The benchmark demonstrates the model can follow instructions. It says nothing about whether the model tracks what the user actually needs.

User feedback has the same blind spot. Studies show users rate responses 78% highly accurate based on confidence and fluency — independent of whether the response addressed their actual goal. Users who've been failed by intent gaps often attribute the failure to themselves ("I didn't explain it clearly enough") rather than to the model. This self-attribution suppresses the negative signal in any feedback loop.

RLHF compounds the problem by training sycophancy directly. Models learn to maximize human preference ratings. Agreeable, confident, fluent responses score higher than responses that push back, ask clarifying questions, or expose ambiguity. Sycophantic behavior was observed in 58% of cases across major models in a 2025 evaluation, with a 78.5% persistence rate. The reward model teaches models to answer the stated question in a way that feels good, not to identify and address the underlying goal.

Production traces reveal the real failure pattern. Analysis of 1,200 production deployments found that most production agent failures are not errors in the technical sense — the system runs, the model responds, the output is delivered. The failure is semantic. Error logs show nothing. Monitoring dashboards show nothing. Users are quietly dissatisfied.

Patterns That Close the Gap

Decompose instructions before executing them. Before answering, explicitly identify the immediate request, infer the likely final goal, surface unstated constraints, and check for misalignment. This is a prompting pattern, not an architecture change. It adds a reasoning step that asks: what is this user actually trying to accomplish? The research calls this "intent concretization" — prompt rewrites that surface latent intent achieve 80% win rates over originals.

Build proactive clarification into the interaction model. The base GPT-4o asks clarifying questions only 15% of the time. Models trained specifically for collaborative interaction ask 52% of the time, achieving 18.5% higher task performance and 17.6% higher user satisfaction. The counterintuitive finding: users don't experience clarifying questions as friction when they're well-targeted. They experience being understood.

The practical implementation: design your system prompt to ask one focused clarifying question before complex tasks rather than attempting to infer intent and proceeding. Gate clarification on task complexity — simple requests don't warrant it, multi-step tasks usually do.

Separate intent tracking from execution. Multi-turn performance degradation occurs because models lose the thread of what the user is actually trying to accomplish as conversation history grows. A Mediator component that explicitly maintains a goal representation — "the user is trying to do X, with constraints Y, having tried Z" — and feeds that representation into the execution model recovers roughly 20% of the multi-turn performance loss.

This doesn't require a second model. It can be implemented as a dedicated context prefix that's updated after each turn with an explicit goal statement. What matters is that goal tracking is a first-class artifact, not implicit in the conversation history.

Track goal state explicitly across turns. User goals fall into five trackable categories: task objectives, requirements, preferences, profile, and policy constraints. Systems that maintain explicit state for each category and update it after each turn — rather than leaving goal inference entirely to the model on each call — show 14-point absolute improvements in multi-turn success rates. In practice, this looks like a structured JSON object describing what the user is trying to accomplish, updated incrementally, passed forward with each request.

Measure intent satisfaction, not just output quality. The eval change that most directly addresses the gap: add a separate evaluation pass asking "was the user's underlying goal addressed?" distinct from "was the output high quality?". These questions have different answers. A response can be fluent, accurate, well-formatted, and structurally correct while completely missing the point. You need to measure both dimensions to see the problem.

Error analysis from production traces is more informative than any benchmark for surfacing intent failures. The methodology is mechanical: sample traces where users provided follow-up corrections, cluster the corrections by failure type, quantify which categories are most common. Intent failures show up as "this isn't what I wanted" or "you're solving the wrong problem" — a distinct pattern from factual corrections or format complaints.

The Gap Between Capability and Usefulness

The intent gap is a product problem masquerading as a model problem. The underlying models have the capability to infer user goals — GOOD (Bayesian goal inference) achieves 94% accuracy on Llama-3 8B, which is not a sophisticated model. The failure is that production deployments don't ask models to explicitly track and verify goals. They ask models to answer questions.

The product decision is whether to optimize for the interaction that feels smooth — minimal friction, immediate response, no interruption — or the interaction that accomplishes the user's actual objective. Klarna learned this directly. Their AI system handled 2.3 million conversations per month with measured efficiency. By mid-2025 they had reversed course, rehiring human agents. The stated reason: the AI answered questions correctly but couldn't navigate the empathy, nuance, and complex resolution users actually needed.

That's the intent gap in production. The system did what it was asked. It just didn't do what was needed.

Closing the gap requires treating goal inference as a first-class engineering concern: explicit tracking, dedicated evaluation, and system design that prioritizes understanding what users actually want over answering what they literally said.

References:Let's stay in touch and Follow me for more thoughts and updates