Skip to main content

Clarification Budgets: When Your Agent Should Ask Instead of Guess

· 10 min read
Tian Pan
Software Engineer

The two worst agent failure modes feel like opposites, but they originate from the same broken policy. The first agent asks four follow-up questions before doing anything and trains its users to abandon it. The second agent never asks, confidently produces output the user has to redo, and trains its users to mistrust it. Same policy, different settings of one missing parameter: the cost of a question relative to the cost of a wrong answer.

Most agents do not have a policy at all. The model is asked to "be helpful" and is left to negotiate ambiguity on its own. Because next-token prediction rewards committing to an answer, the agent leans toward guessing. Because RLHF rewards politeness, the agent occasionally over-corrects and asks a question for safety. The result is unprincipled behavior that varies from session to session, with no team-level intuition about when the agent will pause and when it will charge ahead.

A clarification budget is the missing parameter. It is a per-task allowance for how much friction the agent is permitted to impose, paired with a decision rule for when a question is worth spending that budget on. Think of it as the conversational analog of a latency budget — every product has one, even if no one wrote it down, and the team that writes it down stops shipping confused agents.

Why "ask vs. guess" cannot be left to the model

The model knows nothing about your product economics. It does not know whether your user is a doctor placing a controlled-substance order or a teenager picking a movie. It does not know that your support agent's wrong answer triggers a $40 human escalation, or that a friction-y onboarding agent loses 12% of signups at every additional question. Those are facts about your product, and they belong in a policy layer that wraps the model.

This is why "just ask the model to ask when uncertain" produces such inconsistent behavior in production. The model is being asked to decide a routing question — answer now, or pause for input — without any of the inputs that should govern the decision. Researchers studying clarification in coding agents have framed this as a fundamental gap: without explicit modeling of which questions matter, how confident the model needs to be to commit, and what the cost of asking is, the agent has no principled criterion for when to stop asking and when to start. The work on structured uncertainty in LLM agents formalizes this as a partially-observable decision problem with a Bayesian Value of Information objective — academic language for what product teams already know intuitively. Asking is an action with a cost, answering is an action with a different cost, and the right tradeoff depends on the situation.

The takeaway is not that every team needs to implement a POMDP. It is that the decision is structural, and structure beats vibes.

The four ingredients of a clarification budget

A workable clarification budget has four parts. None of them are exotic, but all four need to be explicit for the system to behave consistently.

A budget per task. Every task type starts with a finite allowance — say, two questions for a support flow, zero for a quick lookup, four for a complex onboarding configuration. The budget decays as questions get asked. Once it is spent, the agent must commit, even if uncertain. This forces the agent to spend its questions on the most valuable disambiguation, rather than spreading them across every minor uncertainty.

An information-gain heuristic. Before asking, the agent should estimate how much its expected error would drop if the user answered. A question that distinguishes between two equally-likely interpretations of a request is high-value. A question that picks between two interpretations where one is 95% likely is low-value — just commit to the likely one and offer a graceful undo. This is the place where the academic Value-of-Information framing genuinely earns its keep, even if the implementation is a coarse heuristic rather than a true Bayesian computation.

A confidence threshold for committing. Below a threshold, the agent asks. Above it, the agent commits with a "tell me if I got that wrong" affordance. The threshold is a product decision, not a model property — it should be tuned per surface based on the cost of a wrong answer. A surgical-records lookup has a high threshold; a "suggest a song" feature has a low one.

A graceful-correction affordance. When the agent commits without asking, the user must have a frictionless path to redirect. "I assumed you meant X — say 'no, I meant Y' if that's wrong" is dramatically less annoying than a question, because it lets the right-on-first-try path stay fast. The willingness of users to tolerate occasional wrong guesses depends almost entirely on how cheap it is to correct them.

The two failure modes at the extremes

Pull any one of those four levers too far and the agent fails in a recognizable way.

The over-asker is the agent that has set its confidence threshold too high or its information-gain bar too low. It asks "did you mean the iOS app or the web app?" when 99% of users only have the web app installed. It asks "would you like that scheduled for this week or next?" when the user already said "tomorrow." Each individual question seems defensible in isolation, which is why over-asking is hard to debug — every PM who reviews a single trace says "yeah, asking made sense there." The pathology only shows up at the session level, in the form of users abandoning the conversation or requesting a human three turns in.

The over-guesser is the inverse. Its confidence threshold is too low, its budget is generous, and its correction affordance is weak or buried. The agent confidently dispatches a returns request for the wrong order, books the wrong meeting room, drafts a customer email with a guessed-wrong product name. Each individual guess seems defensible in isolation, too — the model picked the most likely interpretation. But the user-side experience is one of an agent that does not listen, because every wrong guess feels like the system ignoring information the user thought they had provided.

The diagnostic question that separates the two is: does the user know what the agent committed to? An over-guesser that surfaces "I assumed X — change?" is much closer to a well-tuned agent than one that quietly executes and surprises the user later. The corrigibility of the commitment matters as much as the accuracy.

Why eval discipline has to track both axes

The most common eval mistake here is to track accuracy and call it done. Accuracy on its own is satisfied by the over-asker, which never commits to anything until it has interrogated the user into compliance. It is also satisfied, weirdly, by the over-guesser whose users have learned to fix every output by hand — the eval scores look fine because the post-correction output is correct.

A real eval suite for clarification has to track at least four numbers in parallel. Clarification rate, sliced by task type, so an over-asker shows up as an outlier on a particular flow. Post-clarification accuracy, so a team that asks more questions can prove the questions are buying something. Post-guess accuracy, so a team that asks fewer questions can prove they are not just shipping more wrong answers. And user-reported correction rate, the catch-all signal that whatever you are doing is annoying users at the session level, even if individual turns look fine.

Without all four, the team optimizes one number while another silently degrades. The most painful version is a team that lifts accuracy by raising the clarification rate, ships the change, and then watches user retention drop because the agent has become exhausting to talk to. The accuracy dashboard is green and the product is dying. This is exactly the dynamic Nielsen Norman has flagged in its broader work on AI chat UX — the friction is invisible to the metrics that look most authoritative.

A policy layer is the architectural fix

The biggest mistake teams make is treating "should I ask?" as a prompt-engineering problem. It is not. It is a routing decision the model is structurally unqualified to make alone, because the model does not have access to the product economics that should govern it.

The architectural fix is a policy layer between the model and the user that owns the ask-vs-guess decision. The model still proposes — it generates the candidate action plus its own self-reported confidence and a list of disambiguating questions it would ask given infinite budget. The policy layer disposes — it consults the per-surface budget, the cost-of-wrong-answer parameter, the user's session history (have they already been asked twice?), and decides whether to forward the question, commit with a correction affordance, or commit silently.

This separation is what makes the system tunable. When the support team finds that the agent is over-asking on returns, they tune the threshold for returns specifically. When the legal team requires confirmation for any data deletion, they set the threshold for deletion to "always ask." When a new high-value enterprise customer demands a faster, less-interrupted experience, the budget for that account can be raised without retraining anything. None of this is possible if the ask-vs-guess decision is buried in the system prompt of an undifferentiated model.

The policy layer is also where the eval signals close the loop. The four numbers above feed back into per-surface threshold adjustments, ideally automatically. Most teams will start with manual tuning and a quarterly review. The mature endpoint is a closed-loop system where surfaces with rising user-correction rates have their thresholds tightened and surfaces with rising clarification rates relative to information gain have their thresholds loosened.

What to build first

Most teams reading this do not have a clarification budget. They have a system prompt that says something like "ask clarifying questions when needed," and they have user complaints in both directions — too chatty and too presumptuous, sometimes from the same user in the same week.

The smallest useful first step is to write down, for each major task type, two numbers: the maximum number of questions the agent is allowed to ask, and the cost of a wrong answer in that flow expressed in something concrete (dollars, minutes of human review, churn risk). Then pick one surface where the disagreement between PM, eng, and support is loudest, and instrument the four eval signals on that surface for two weeks.

You will discover two things. First, that the team has more agreement on the right behavior than the conversation suggested — once the budget and the cost of wrongness are written down, "should we ask here?" becomes answerable rather than aesthetic. Second, that the model's existing behavior is not actually that close to the budget you wrote down, which gives you a concrete tuning target.

Asking-vs-guessing will not disappear as a problem when models get smarter. Smarter models will guess more confidently, which makes the cost of an unchecked wrong guess higher, not lower. The teams that build a policy layer now will spend the next decade adjusting two numbers per surface. The teams that don't will spend the next decade explaining to their users why the agent didn't listen — or why it asked too much.

References:Let's stay in touch and Follow me for more thoughts and updates