Building a Hallucination Detection Pipeline for Production LLMs
Your LLM application passes every eval. The demo looks flawless. Then a user asks about a niche regulatory requirement and the model confidently cites a statute that doesn't exist. The support ticket lands in your inbox twelve hours later, long after the fabricated answer has been forwarded to a compliance team. This is the hallucination problem in production: not that models get things wrong, but that they get things wrong with the same fluency and confidence as when they get things right.
Most teams treat hallucination as a prompting problem — add more context, tune the temperature, tell the model to "only use provided information." These measures help, but they don't solve the fundamental issue. Post-hoc verification — checking claims after generation rather than hoping the model won't make them — is cheaper, more reliable, and composes better with existing infrastructure than any prevention-only strategy.
The Three Categories of Hallucination
Not all hallucinations are created equal, and your detection strategy depends on which type you're dealing with.
Fabrication is the most dangerous category. The model invents entities, citations, statistics, or events that have no basis in any source. A fabricated legal citation looks syntactically correct — proper reporter format, plausible volume numbers — but points to a case that was never decided. Research estimates that AI-hallucinated citations now number in the hundreds of thousands across published literature. Fabrication requires existence verification: checking that referenced entities actually exist in an authoritative source.
Contradiction occurs when the model's output conflicts with information in its provided context. A RAG system retrieves a document stating quarterly revenue was $4.2 billion, then the model's summary says $4.8 billion. This is the most mechanically detectable category because you have the ground truth right there in the context window. Natural Language Inference (NLI) models excel here — they classify whether a claim is entailed by, neutral to, or contradicted by a source passage.
Outdated knowledge is the subtlest failure mode. The model's parametric knowledge was correct at training time but the world has moved on. A company changed its CEO, a regulation was amended, a library deprecated a function. This category is particularly insidious because the model states facts it was trained on with high confidence, and no amount of prompt engineering fixes a training data problem. Detection requires freshness-aware retrieval: checking claims against current sources rather than trusting the model's internal knowledge.
Each category demands different detection machinery. A pipeline that catches contradictions but ignores fabrication provides false confidence — you catch the easy problems and miss the ones that actually damage trust.
Anatomy of a Production Detection Pipeline
The most effective hallucination detection architecture follows a three-stage pattern: classify, detect, explain. This approach, pioneered in systems like vLLM's HaluGate, balances latency with precision by avoiding expensive detection on queries that don't need it.
Stage 1: Sentinel classification. Not every LLM response needs factual verification. Creative writing, brainstorming, and opinion-seeking queries can pass through unchecked. A lightweight binary classifier — a fine-tuned encoder model with a LoRA adapter — determines whether the incoming query is fact-seeking. This classifier runs in roughly 12ms at P50 and filters out about 35% of requests that don't require verification, cutting downstream detection costs by over 70%.
Stage 2: Token-level detection. For fact-seeking queries, a token classifier examines each token in the response against the provided context (tool outputs, RAG results, or other grounding sources). Rather than producing a binary "hallucinated or not" label for the entire response, token-level detection identifies exactly which spans lack support. The input format concatenates context, question, and answer with separator tokens, and the classifier labels each answer token as supported or unsupported. Consecutive unsupported tokens merge into spans for human-readable output.
Stage 3: NLI explanation. Token-level detection alone produces roughly a 33% false positive rate and misses about half of actual hallucinations. The explanation layer applies Natural Language Inference to each flagged span, classifying it as entailed (likely false positive — filter it out), neutral (unverifiable claim), or contradicted (direct factual error). This three-way classification maps to severity scores that drive downstream decisions: contradictions block or flag the response, neutral claims get logged for review, and entailment results get cleared.
The total pipeline latency runs 76ms at P50 and 162ms at P99 — negligible against typical LLM generation times of 5–30 seconds. Latency scales linearly with context length: about 125ms for 4K tokens, rising to 365ms at 16K tokens.
Grounding: Where Your Ground Truth Comes From
A detection pipeline is only as good as its reference material. The three practical sources of ground truth in production are tool outputs, RAG retrieval results, and structured data.
Tool outputs as ground truth. If your agent calls a function that returns structured data — a database query, an API response, a calculation result — that output is your highest-confidence reference. The model's response should be consistent with these tool results. This is the easiest verification path because tool outputs are typically precise and machine-readable. In function-calling architectures, the tool response content serves directly as the context for hallucination detection.
RAG retrieval results. Retrieved passages provide grounding for knowledge-intensive queries, but they introduce a complication: the retrieval itself might be insufficient. Research shows that over 50% of complex retrieval queries lack sufficient context for correct generation even when retrieval "succeeds" by similarity metrics. Your detection pipeline needs to distinguish between three failure modes: the model hallucinated despite having correct context (generation failure), the retrieved context didn't contain the answer (retrieval failure), or the model correctly abstained from answering (the desired behavior when context is insufficient).
Structured data and knowledge bases. For entity verification — checking that a cited person, company, statute, or product actually exists — you need lookup against authoritative sources. This doesn't require a model; a database query or API call suffices. The challenge is coverage: you need the right reference data for your domain.
One critical design principle: never verify claims against the model's own output. Using the same LLM to both generate and verify creates a closed loop where the model confirms its own fabrications. Cross-model verification (using a different model for checking) helps, but grounding against external sources is always preferable.
Measuring Hallucination Rate Without a Human-in-the-Loop Bottleneck
You can't fix what you can't measure, and you can't measure hallucination at scale if every claim requires a human reviewer. Six automated metrics form a practical measurement stack.
Faithfulness is the primary metric. For each sentence in the model's output, check whether it's supported by the provided context. Implementation ranges from NLI-based entailment scoring to LLM-as-a-judge with constrained rubrics. The ratio of unsupported claims to total claims gives you a faithfulness score per response. Aggregate across traffic for a system-level hallucination rate.
Span-level attribution goes beyond binary faithfulness to verify that citations actually support their claims. Citation presence has near-zero correlation with factual accuracy — models cite passages at nearly identical rates whether the citation supports the claim or not. You need to verify that (a) the cited passage exists and (b) it actually entails the claim it's attached to.
Consistency catches hallucinations through self-agreement. Sample multiple generations for the same query and measure agreement. Inconsistent answers suggest the model is generating from its prior rather than from grounded context. This metric is computationally expensive (requiring multiple generations) but catches a class of hallucination that single-pass metrics miss.
- https://blog.vllm.ai/2025/12/14/halugate.html
- https://www.nature.com/articles/s41586-024-07421-0
- https://arxiv.org/abs/2406.15927
- https://arxiv.org/abs/2510.06265
- https://www.getmaxim.ai/articles/measuring-llm-hallucinations-the-metrics-that-actually-matter-for-reliable-ai-apps/
- https://deepeval.com/docs/metrics-hallucination
- https://engineering.salesforce.com/grounding-enterprise-ai-with-live-web-retrieval-and-verifiable-citations/
- https://arxiv.org/abs/2603.22812
