Skip to main content

The Agent That Narrated a Number It Should Have Computed

· 10 min read
Tian Pan
Software Engineer

Ask your agent for last quarter's churn rate and it answers 4.2% in one clean sentence. The number is plausible. The prose around it is confident. The dashboard, when someone finally checks, says 6.8%. The agent never queried anything — it produced a churn-shaped token sequence because, to a language model, narrating a number and computing one look identical on the way out.

This is the quiet failure mode that survives every demo. A hallucinated tool name throws an error you can catch. A malformed argument fails a schema check. But a fabricated figure, delivered in fluent English, passes through your entire pipeline looking exactly like a real one. There is no exception, no log line, no red text. The only signal that something went wrong is a human who happens to know the right answer — and the whole point of the agent was that no human had to.

The reason this is worth a dedicated post, rather than a footnote under "hallucination," is that the usual hallucination advice does not apply. You cannot RAG your way out of it, because the model had the tool and chose not to use it. You cannot prompt your way out of it reliably, because "always use the analytics tool for numbers" is an instruction the model will follow most of the time and quietly drop the rest. The fix is structural: you have to make narrating a number impossible where it matters, and make every number it does produce carry its receipt.

Why Fluency Hides the Missing Tool Call

A language model generates the next token by sampling from a probability distribution conditioned on everything before it. When the context asks for "last quarter's churn," the high-probability continuation is a percentage in the low single digits, because that is what churn rates look like in the training data. The model is not consulting a fact. It is completing a pattern. The output 4.2% and the output of an actual SQL query are the same data type — a string — and the generation process that produced the fake one is the same process that would have formatted the real one.

This is why fluency is the camouflage. We have learned to treat hesitation, hedging, and malformed output as smells. A fabricated number has none of those smells. It arrives with the same cadence as a correct answer because, mechanically, it is the same kind of output. Research on LLM-agent hallucinations breaks this into two families: faithfulness errors, where the agent contradicts its own retrieved evidence, and factuality errors, where it asserts something with no evidence at all. The narrated number is the second kind, and it is the harder one to catch because there is no evidence to contradict — there is just absence, and absence does not render.

It gets worse for quantitative questions specifically, because models are bad at the arithmetic they skip the tool to do. State-of-the-art models score under 40% on 7-digit division. A study of 48 medical calculation tasks found wrong answers in roughly a third of trials. The model is not being lazy in some correctable way — transformer inference is a probabilistic text generator, not a calculator, and it cannot do deterministic arithmetic reliably no matter how you prompt it. So the narrated number is doubly dangerous: the model both declines to call the tool and is structurally incapable of producing the answer the tool would have.

An Answer Is Not a Sourced Answer

The core distinction your system needs to encode is the one between an answer and a sourced answer. They are not two grades of the same thing. They are different objects.

An answer is a string that occupies the slot where a number should go. A sourced answer is a string plus a verifiable pointer to the computation that produced it: which tool ran, with which arguments, against which data, at what time. The first is a claim. The second is a claim with a receipt.

Most agent stacks today produce the first and present it as if it were the second. The user sees 4.2% and reasonably assumes the system knows it, because the system is connected to the analytics database — surely it looked. But "connected to" and "consulted" are different facts, and the output does not distinguish them. The interface implies provenance the data does not have.

Treating these as different objects changes the engineering. A sourced answer can be validated: you can check that a tool call exists in the trace, that its result matches the figure in the prose, that the data it touched is fresh enough. An unsourced answer can only be trusted. And trust, applied to a system whose failure mode is confident fabrication, is not a safeguard. It is the vulnerability.

So the design rule is blunt: an unsourced number in agent output is a defect. Not a style nit, not an acceptable approximation — a defect, the same severity as a 500 error. It should fail a check, not earn a shrug.

Make Narrating a Number Impossible

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates