The Transcript Layer Lie: Why Your Multimodal Pipeline Hallucinates Downstream
Your ASR system returned "the patient takes metaformin twice daily." The correct word was metformin. The transcript looked clean — no [INAUDIBLE] markers, no error flags. Confidence was 0.73 on that word. Your pipeline discarded that number and handed clean text to the LLM. The LLM, treating it as ground truth, reasoned about a medication that doesn't exist.
This is the transcript layer lie: the implicit assumption that intermediate text representations — whether produced by speech recognition, OCR, or vision models parsing a document — are reliable enough to pass downstream without qualification. They aren't. But almost every production pipeline treats them as if they are.
The problem isn't that ASR and OCR are inaccurate. It's that they lie about which parts are inaccurate. These systems return confidence scores per word, per field, per token. And the standard engineering pattern is to immediately throw those scores away.
The Confidence Signal You're Discarding
Every major ASR provider attaches per-word confidence scores to its output. Whisper produces log-probability scores per token; word-level confidence is derived by averaging those values. AWS Transcribe and Google Speech-to-Text both expose word-level confidence in the range 0 to 1. The same is true for OCR: Azure Document Intelligence returns confidence scores down to the individual field level. Google Cloud Document AI does too.
None of this matters if you build your pipeline like most teams do:
audio → ASR → transcript string → LLM prompt
The transcript string has no memory of which words were confident and which were guesses. The 0.73 on "metaformin" gets stripped. The LLM sees flat text and has no way to know that one word in the middle is unreliable. It reasons from the input as given.
OCR pipelines make this mistake in a slightly different way. Document-level confidence scores are commonly surfaced — "this page extracted at 92% confidence" — but that number obscures everything interesting. A form that's 92% accurate overall might have three critical fields extracted at 0.40. The document-level aggregate hides the field-level failures behind a reassuring number.
Microsoft's research on their Document Intelligence platform has been explicit about this: document AI systems fail silently. The downstream consumer acts on wrong data without any signal that it happened.
How Uncertainty Compounds Downstream
A misrecognized word isn't just a data quality problem. It's a corruption event that propagates.
When an LLM receives a noisy transcript without any uncertainty signal, it does one of two things: it takes the text at face value, or it tries to "fix" it. The second behavior is actually worse. Research on LLM-based ASR error correction shows that when models attempt to correct transcripts, they hallucinate over correct words while trying to repair incorrect ones. The hallucination rate on accurate text increases as overall input quality decreases — the LLM can't distinguish which parts to trust.
The mechanism is straightforward. Low-confidence words disrupt the attention patterns that LLMs rely on to build coherent interpretations. Uncertainty in the input is, in effect, negatively correlated with importance: the words the model is least sure about are often the ones that carry the most semantic weight — names, numbers, domain-specific terms, medications.
In agent-based pipelines, the compounding is worse. A perception step that produces noisy text feeds a reasoning step that builds a plan on top of it. The plan feeds an action step. Errors accumulate through each handoff. By the time the system acts, the original uncertainty has been laundered — it looks like confident, structured output, but the lineage includes a coin-flip transcription two steps back.
Distribution shift makes this dramatically worse. ASR systems hallucinate most aggressively under conditions they weren't tuned on: heavy accents, background noise, specialized vocabularies, cross-lingual audio. Under these conditions, some systems generate spurious words during silence. The LLM downstream receives what looks like normal text and has no basis for skepticism.
Three Design Patterns That Actually Work
Confidence-Gated Routing
The simplest fix is to treat confidence scores as routing signals rather than metadata to discard. Instead of passing all transcript output to the LLM uniformly, build a threshold-based router:
- Words or fields above a confidence threshold (typically 0.80 for general use, 0.95 for high-stakes domains) proceed directly to the LLM pipeline.
- Words below the threshold are flagged, quarantined, or sent to a separate handling path — human review, a slower and more capable model, or an explicit request for clarification.
AWS Lex V2 exposes this pattern natively: transcription confidence scores from AWS Transcribe can gate whether the bot attempts to respond or escalates to a human agent. The same pattern generalizes to any pipeline. The key insight is that you don't need the LLM to handle everything — you need it to handle the things it can handle correctly.
This approach does require you to actually keep the confidence scores through your processing pipeline, which means changing how you represent intermediate state. A transcript is not a string. It's a list of (word, confidence) pairs. Your data model should reflect that.
Confidence-Aware Prompting
When you must send uncertain content to an LLM, the context window is the right place to surface what you know. Instead of:
Transcript: the patient takes metaformin twice daily
Send:
Transcript (word confidence in brackets): the [0.98] patient [0.96] takes [0.99] metaformin [0.73] twice [0.97] daily [0.95]
Words below 0.80 confidence may be misrecognized. Do not make clinical inferences from low-confidence words without flagging the uncertainty.
Research on confidence-augmented prompting shows measurable reductions in downstream error rates when models are given explicit signals about input reliability. The model can't always use this information correctly — it doesn't fundamentally "understand" that 0.73 means uncertainty — but explicit uncertainty labels in the prompt reduce the probability that the model treats borderline inputs as authoritative.
For OCR pipelines, the same technique applies at the field level. Instead of passing a flat JSON of extracted fields, include a parallel confidence map. Instruct the model to treat low-confidence fields as absent unless confirmed by other evidence in the document.
Ambiguity Signaling Over Silent Correction
The worst pattern in noisy input pipelines is silent best-guess correction: the system picks the most likely interpretation and proceeds as if it were certain. This is how confident-looking output masks a probabilistic chain of assumptions.
A better approach is to make the ambiguity explicit and push it downstream in a form the next component can act on. Instead of resolving "metaformin" to its best guess, represent it as an unresolved ambiguity: [metaformin|metformin|metroformin]. The downstream system — whether it's an LLM, a human reviewer, or a business rules engine — can then decide how to handle the ambiguity based on the stakes of that decision.
Research on ambiguity detection pipelines formalizes this as a three-stage process: identify ambiguous inputs, generate clarifying questions or candidate options, resolve with context. The resolution step can happen at the LLM layer (using surrounding context to pick the most plausible option), at the human layer (presenting options to a reviewer), or at the business logic layer (escalating when any ambiguity exists in high-stakes fields).
The important shift is architectural: ambiguity is not something to eliminate before the LLM sees the input. It's information to carry through the pipeline until a component with sufficient context can resolve it.
Graceful Degradation in Practice
All three patterns above require pipeline changes, which means upfront engineering investment. Teams often resist this because the problem isn't visible — pipelines appear to work, errors look like model issues rather than data quality issues, and the connection between a 0.73 confidence score two steps back and a hallucinated output at the end is hard to trace.
The practical path is to start with graceful degradation as a fallback rather than a redesign:
Route uncertain cases out of the automated path. Any input where aggregate confidence falls below a threshold should not be processed automatically. This doesn't require rethinking your data model — just add a confidence gate before the LLM call.
Log confidence scores even if you don't use them. Production debugging of multimodal pipelines is extremely difficult without this data. If you know that a run where the LLM produced a bad output had a transcript with three words below 0.70, that's a signal you can act on. If you threw the scores away, you're debugging a black box.
Distinguish document-level from field-level confidence. If your OCR pipeline returns an aggregate score, check whether the underlying API supports field-level scores. Most modern document AI providers do. Aggregate scores hide failures in exactly the fields that matter most.
Test under distribution shift, not just benchmark conditions. Your ASR system's benchmark accuracy is measured on clean audio from the provider's training distribution. It tells you nothing about how it performs on your call center audio, your noisy warehouse environment, or your users with non-native accents. Measure WER and confidence calibration on your actual input distribution before treating the system as reliable.
The Deeper Pattern
The transcript layer problem is an instance of a more general failure mode in composed AI systems: intermediate representations lose the uncertainty metadata that the original model produced, and downstream components reason from the lossy representation as if it were authoritative.
This happens at every composition boundary. Vector embeddings lose the confidence scores of the retrieval model. Structured extraction loses the uncertainty of the extraction step. Summaries lose the caveats in the source material. Each transformation compresses uncertainty information, and by the time the final output is produced, the system appears confident about things it should be uncertain about.
The fix isn't to abandon composition — it's to treat uncertainty as a first-class data type that passes through pipelines rather than getting stripped at each boundary. Confidence scores aren't metadata to be discarded after the ASR call returns. They're signals about the reliability of everything that follows.
Building multimodal pipelines that stay honest about uncertainty is harder than building ones that don't. But it's the difference between a system that fails loudly on bad inputs and one that confidently produces wrong outputs until a human notices something went wrong three months later.
- https://arxiv.org/abs/2505.24347
- https://arxiv.org/html/2502.12414v1
- https://arxiv.org/abs/2502.13446
- https://arxiv.org/html/2407.21414v1
- https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/concept/accuracy-confidence
- https://landing.ai/blog/introducing-confidence-scores-surface-parsing-uncertainty-before-it-becomes-a-problem
- https://docs.aws.amazon.com/lexv2/latest/dg/using-transcript-confidence-scores.html
- https://docs.cloud.google.com/speech-to-text/docs/word-confidence
- https://www.nature.com/articles/s43856-025-01021-3
- https://arxiv.org/html/2503.02863v1
- https://openaccess.thecvf.com/content/CVPR2024/papers/Gao_Embracing_Unimodal_Aleatoric_Uncertainty_for_Robust_Multimodal_Fusion_CVPR_2024_paper.pdf
- https://blog.vllm.ai/2025/12/14/halugate.html
- https://arxiv.org/html/2503.15124v1
