Skip to main content

When AI Sounds Right but Isn't: LLM Confabulation in Technical and Scientific Domains

· 9 min read
Tian Pan
Software Engineer

The insidious thing about LLM confabulation in technical domains isn't that the model produces obviously wrong answers. It's that the model produces beautifully structured, confidently stated, technically plausible answers that are subtly wrong in ways that only domain experts catch — and often only after the damage is done.

A Monte Carlo physics simulation that initializes correctly but resamples particle positions from scratch at each step rather than making incremental updates. A chemical formula that follows the right naming conventions but has an incorrect oxidation state. An engineering specification that cites the right standard, references the right units, and has exactly the wrong load coefficient. Each output looks right. Each sounds authoritative. Each is wrong in ways that won't surface until someone runs the experiment, stress-tests the component, or critically reads the derivation.

This is the core problem with using LLMs for technical and scientific knowledge work: the failure modes are systematically invisible in proportion to how dangerous they are.

Why Technical Domains Are Uniquely Vulnerable

General confabulation — an LLM that invents a historical date or misquotes a CEO — is annoying and correctable. Technical confabulation is structurally different for three reasons.

The training data doesn't distinguish consensus from fringe. An LLM absorbs a peer-reviewed Physical Review Letters paper and an alternative-physics blog post as equivalent text patterns. Both describe Newtonian mechanics in confident prose. Both use the right vocabulary. The model has no signal to weight one over the other, so when it generates explanations of gravitational acceleration, it's drawing on a distribution that includes both correct and incorrect accounts of the phenomenon, with equal fluency.

Unit and dimensional errors compound silently. In prose domains, an error introduced in sentence three is usually self-contained. In technical domains, an error introduced in derivation step three propagates through every subsequent step. An LLM that confuses Newtons and kilograms early in an engineering calculation will produce a structurally complete derivation with internally consistent algebra that arrives at a physically nonsensical answer — and the algebra being clean makes the result look credible.

Confidence is decoupled from correctness, and alignment makes it worse. A crucial and underappreciated finding from recent calibration research: RLHF alignment — the preference training that makes models "helpful" and "safe" — actively degrades calibration. Pre-trained base models are typically better calibrated than aligned chat models. After preference alignment, models become systematically overconfident. The fine-tuning that makes Claude sound helpful is the same process that makes it sound certain about things it shouldn't be certain about. In technical domains, where the cost of overconfidence is high, this is a direct structural hazard.

What Failure Looks Like in Practice

Benchmarks tell part of the story. On expert-level STEM evaluations — the kind that require genuine domain knowledge rather than pattern completion — frontier LLMs perform substantially below human specialists. On Humanity's Last Exam, 2,500 expert-level questions spanning physics, chemistry, biology, and advanced mathematics, the best available models still fail close to half the questions. On the Google-Proof Question Answering benchmark, which uses questions written by domain PhDs to require genuine reasoning rather than retrieval, frontier models score significantly below the 65% baseline that domain-expert humans achieve.

But benchmarks understate the problem because they measure average error rates on isolated questions. Real technical work involves chains of reasoning where one wrong step corrupts everything downstream, and where the absence of a visible error message is a false signal of correctness.

Case studies paint a more concrete picture. Research examining LLM-assisted workflows in astrophysical data analysis documented a specific failure mode: agentic systems generating physically impossible posterior distributions from under-constrained tasks, without any error or warning, and presenting the outputs as valid results. The system ran to completion. The outputs had the right format. The physics was wrong. Only an astrophysicist examining the distributions would notice.

In medical contexts, research testing models on clinical vignettes with planted errors — incorrect lab values, nonexistent diagnoses, missing critical context — found that models repeated or elaborated on the errors up to 83% of the time. Not only did the models not catch the errors; they confidently built clinical reasoning on top of them.

Chemistry-specific research found persistent fundamental knowledge gaps that weren't addressable by reprompting or chain-of-thought elaboration. The model didn't know what it didn't know, and couldn't distinguish the limits of its chemistry knowledge from the body of chemistry knowledge it had absorbed correctly.

The "Sounds Right" Asymmetry

What makes all of this tractable to understand but hard to solve is a fundamental asymmetry: the very properties that make LLMs useful in technical domains — fluent generation, confident prose, correct domain vocabulary, familiarity with standard notation and structure — are the same properties that make their errors hard to catch.

A junior engineer asking an LLM to explain a load calculation gets back an answer that uses the right terminology, cites the right factors, structures the derivation logically, and has the wrong number buried in step four. They have no reason to distrust the output. The vocabulary is correct. The format is right. The confidence level signals accuracy.

This asymmetry is worst for teams operating in domains adjacent to their expertise. A software team building chemistry simulation tools. A mechanical engineer consulting an LLM on regulatory compliance outside their specialty. A researcher using an LLM to summarize literature in a subfield they're entering. These are exactly the use cases that look most valuable and carry the most risk.

Architectures That Reduce Confident-Wrong Outputs

The good news is that the failure modes are addressable — not by making LLMs better calibrated through prompting, but by building grounding architectures that constrain what the model can claim.

Citation-required generation. The core idea is to require any claim to be backed by a specific cited source, and then mechanically verify that the cited source actually supports the claim. This pattern — retrieve, generate with citation constraints, verify citation-claim alignment — doesn't eliminate LLM generation errors, but it eliminates a large class of confident confabulations because the model can't cite sources that don't say what it claims. Systems implementing this pattern see meaningful reductions in verifiable false claims. For internal technical documentation, this translates to requiring LLMs to point to specific line numbers, equations, or sections — not paraphrase — and automating the check.

Step-level derivation verification. For mathematical and physical derivations, requiring the model to show work at each step enables verification that the final answer wouldn't catch. Research on step-level verification models shows that evaluating intermediate reasoning steps, not just outputs, surfaces errors that look invisible at the conclusion. Practically, this means structuring prompts to produce numbered derivation steps and using a verification pass — automated or human — against each step rather than accepting the conclusion.

Expert-in-the-loop gates for consequential outputs. Some categories of technical output simply require human domain experts to review before use. The architectural question isn't whether to include humans but where to place the gate. A useful heuristic: gate on claim type, not output length. A three-sentence chemical interpretation by a PhD researcher requires the same gate as a ten-page engineering specification if both make specific quantitative claims that will drive real decisions.

Retrieval grounded in authoritative sources. Standard RAG that retrieves from broad document collections provides some grounding but doesn't solve the peer-reviewed-versus-fringe problem. Technical RAG implementations that index specifically from peer-reviewed literature, official standards bodies, or domain-specific authoritative databases produce different calibration properties. The retrieval corpus becomes the discriminator between consensus and fringe that the model itself lacks. Combining dense retrieval with source filtering by publisher or authority type is more effective than increasing retrieval count from undifferentiated corpora.

Semantic entropy routing. Research on hallucination detection via semantic entropy offers a practical operational signal: when a model is uncertain about a claim, generating multiple outputs for the same query will produce high divergence across outputs. Confident-and-wrong outputs often show paradoxically low entropy (the model confidently generates the same wrong answer multiple times), but uncertain outputs — where the model is genuinely unsure — show high divergence. Measuring divergence across multiple generations and routing high-entropy outputs to expert review before use catches a meaningful fraction of uncertain outputs that would otherwise flow through undetected.

What This Means for Engineering Teams

The practical implication isn't that LLMs can't be used for technical work. They can be, and they produce genuine value when used correctly. The implication is that the safety architecture matters as much as the capability level.

Teams that get this right treat the LLM output layer as a generator, not an oracle, and invest equivalent effort in the verification layer. They build explicit policies for which output types require citation verification, step-level checking, or domain expert review. They don't rely on asking the model to check its own work — self-evaluation by the same model that generated the output shows similar failure modes.

Teams that get this wrong do so in a predictable pattern: they establish trust on a class of tasks where errors are low-stakes and visible, then extend that trust to tasks where errors are high-stakes and invisible. The model's confidence level doesn't help distinguish these cases; it behaves the same in both. The difference is in the consequences, which means the team needs to do the risk classification, not defer it to the model's apparent certainty.

The core discipline is treating domain expertise as a required component of the architecture, not a cost to be eliminated. LLMs amplify what experts can produce; they don't yet replace what experts can validate. In physics, chemistry, and engineering, that distinction is the difference between a useful tool and a liability.

Technical domains are precisely where the "sounds right" property of LLM generation diverges most sharply from correctness. Building systems that account for this — structurally, not just through prompting — is the engineering work that separates teams that use AI well from teams that eventually get burned by it.

References:Let's stay in touch and Follow me for more thoughts and updates