Skip to main content

Building a Hallucination Detection Pipeline for Production LLMs

· 12 min read
Tian Pan
Software Engineer

Your LLM application passes every eval. The demo looks flawless. Then a user asks about a niche regulatory requirement and the model confidently cites a statute that doesn't exist. The support ticket lands in your inbox twelve hours later, long after the fabricated answer has been forwarded to a compliance team. This is the hallucination problem in production: not that models get things wrong, but that they get things wrong with the same fluency and confidence as when they get things right.

Most teams treat hallucination as a prompting problem — add more context, tune the temperature, tell the model to "only use provided information." These measures help, but they don't solve the fundamental issue. Post-hoc verification — checking claims after generation rather than hoping the model won't make them — is cheaper, more reliable, and composes better with existing infrastructure than any prevention-only strategy.

The Three Categories of Hallucination

Not all hallucinations are created equal, and your detection strategy depends on which type you're dealing with.

Fabrication is the most dangerous category. The model invents entities, citations, statistics, or events that have no basis in any source. A fabricated legal citation looks syntactically correct — proper reporter format, plausible volume numbers — but points to a case that was never decided. Research estimates that AI-hallucinated citations now number in the hundreds of thousands across published literature. Fabrication requires existence verification: checking that referenced entities actually exist in an authoritative source.

Contradiction occurs when the model's output conflicts with information in its provided context. A RAG system retrieves a document stating quarterly revenue was $4.2 billion, then the model's summary says $4.8 billion. This is the most mechanically detectable category because you have the ground truth right there in the context window. Natural Language Inference (NLI) models excel here — they classify whether a claim is entailed by, neutral to, or contradicted by a source passage.

Outdated knowledge is the subtlest failure mode. The model's parametric knowledge was correct at training time but the world has moved on. A company changed its CEO, a regulation was amended, a library deprecated a function. This category is particularly insidious because the model states facts it was trained on with high confidence, and no amount of prompt engineering fixes a training data problem. Detection requires freshness-aware retrieval: checking claims against current sources rather than trusting the model's internal knowledge.

Each category demands different detection machinery. A pipeline that catches contradictions but ignores fabrication provides false confidence — you catch the easy problems and miss the ones that actually damage trust.

Anatomy of a Production Detection Pipeline

The most effective hallucination detection architecture follows a three-stage pattern: classify, detect, explain. This approach, pioneered in systems like vLLM's HaluGate, balances latency with precision by avoiding expensive detection on queries that don't need it.

Stage 1: Sentinel classification. Not every LLM response needs factual verification. Creative writing, brainstorming, and opinion-seeking queries can pass through unchecked. A lightweight binary classifier — a fine-tuned encoder model with a LoRA adapter — determines whether the incoming query is fact-seeking. This classifier runs in roughly 12ms at P50 and filters out about 35% of requests that don't require verification, cutting downstream detection costs by over 70%.

Stage 2: Token-level detection. For fact-seeking queries, a token classifier examines each token in the response against the provided context (tool outputs, RAG results, or other grounding sources). Rather than producing a binary "hallucinated or not" label for the entire response, token-level detection identifies exactly which spans lack support. The input format concatenates context, question, and answer with separator tokens, and the classifier labels each answer token as supported or unsupported. Consecutive unsupported tokens merge into spans for human-readable output.

Stage 3: NLI explanation. Token-level detection alone produces roughly a 33% false positive rate and misses about half of actual hallucinations. The explanation layer applies Natural Language Inference to each flagged span, classifying it as entailed (likely false positive — filter it out), neutral (unverifiable claim), or contradicted (direct factual error). This three-way classification maps to severity scores that drive downstream decisions: contradictions block or flag the response, neutral claims get logged for review, and entailment results get cleared.

The total pipeline latency runs 76ms at P50 and 162ms at P99 — negligible against typical LLM generation times of 5–30 seconds. Latency scales linearly with context length: about 125ms for 4K tokens, rising to 365ms at 16K tokens.

Grounding: Where Your Ground Truth Comes From

A detection pipeline is only as good as its reference material. The three practical sources of ground truth in production are tool outputs, RAG retrieval results, and structured data.

Tool outputs as ground truth. If your agent calls a function that returns structured data — a database query, an API response, a calculation result — that output is your highest-confidence reference. The model's response should be consistent with these tool results. This is the easiest verification path because tool outputs are typically precise and machine-readable. In function-calling architectures, the tool response content serves directly as the context for hallucination detection.

RAG retrieval results. Retrieved passages provide grounding for knowledge-intensive queries, but they introduce a complication: the retrieval itself might be insufficient. Research shows that over 50% of complex retrieval queries lack sufficient context for correct generation even when retrieval "succeeds" by similarity metrics. Your detection pipeline needs to distinguish between three failure modes: the model hallucinated despite having correct context (generation failure), the retrieved context didn't contain the answer (retrieval failure), or the model correctly abstained from answering (the desired behavior when context is insufficient).

Structured data and knowledge bases. For entity verification — checking that a cited person, company, statute, or product actually exists — you need lookup against authoritative sources. This doesn't require a model; a database query or API call suffices. The challenge is coverage: you need the right reference data for your domain.

One critical design principle: never verify claims against the model's own output. Using the same LLM to both generate and verify creates a closed loop where the model confirms its own fabrications. Cross-model verification (using a different model for checking) helps, but grounding against external sources is always preferable.

Measuring Hallucination Rate Without a Human-in-the-Loop Bottleneck

You can't fix what you can't measure, and you can't measure hallucination at scale if every claim requires a human reviewer. Six automated metrics form a practical measurement stack.

Faithfulness is the primary metric. For each sentence in the model's output, check whether it's supported by the provided context. Implementation ranges from NLI-based entailment scoring to LLM-as-a-judge with constrained rubrics. The ratio of unsupported claims to total claims gives you a faithfulness score per response. Aggregate across traffic for a system-level hallucination rate.

Span-level attribution goes beyond binary faithfulness to verify that citations actually support their claims. Citation presence has near-zero correlation with factual accuracy — models cite passages at nearly identical rates whether the citation supports the claim or not. You need to verify that (a) the cited passage exists and (b) it actually entails the claim it's attached to.

Consistency catches hallucinations through self-agreement. Sample multiple generations for the same query and measure agreement. Inconsistent answers suggest the model is generating from its prior rather than from grounded context. This metric is computationally expensive (requiring multiple generations) but catches a class of hallucination that single-pass metrics miss.

Context relevance, precision, and recall measure retrieval quality as a leading indicator. Track "retrieved but unused" passages (wasted context budget) and "claims with no supporting source in the context window" (potential hallucination sources). Low context recall directly predicts higher hallucination rates.

Entailment and contradiction scores provide per-claim severity classification. Wire these into alerting: a contradiction-level finding in a high-stakes domain should trigger immediate escalation, while neutral (unverifiable) claims can be logged and reviewed in batches.

Operational metrics — latency, token counts, and tool call counts per request — correlate with quality in ways that aren't obvious until you measure them. Responses that consume more tokens or trigger more tool calls tend to have higher hallucination rates, likely because the model is attempting more complex reasoning. These metrics serve as cheap proxy signals for prioritizing which responses to send through expensive verification.

Wire these metrics into your CI/CD pipeline: run evaluations on every prompt change and block merges when faithfulness drops below your threshold.

Semantic Entropy: Uncertainty as a Hallucination Signal

One of the most promising detection approaches doesn't analyze the output at all — it analyzes the model's uncertainty during generation.

Semantic entropy, published in Nature in 2024, computes entropy over the meanings of generated sentences rather than the tokens themselves. When a model is confident, multiple samples express the same meaning in different words (low semantic entropy). When a model is hallucinating, samples diverge semantically (high semantic entropy). This detects confabulations — arbitrary incorrect generations — which are among the hardest hallucinations to catch with output-only analysis.

The original implementation had a production-killing limitation: it required 5–10x the normal inference compute, since you need multiple generations per query to estimate entropy. Two recent advances have made it deployable.

Semantic Entropy Probes (SEPs) are linear probes trained on the model's hidden states that capture semantic uncertainty without generating multiple samples. They nearly eliminate the computational overhead at inference time while maintaining detection quality. This is the approach most likely to see production adoption at scale.

Adaptive Bayesian estimation dynamically adjusts sampling requirements based on observed uncertainty. In low-uncertainty scenarios, it terminates early — requiring about 50% fewer samples to achieve comparable detection performance. For high-uncertainty cases, it invests more samples where they're most informative.

The practical tradeoff: semantic entropy methods require access to model internals (hidden states or logits), which rules them out for closed-source API providers unless the provider exposes them. For self-hosted models, they're increasingly viable. For API-based deployments, you're limited to output-analysis methods like NLI and consistency checking.

Designing the Decision Layer

Detection without action is just logging. Your pipeline needs a decision layer that translates detection results into system behavior.

Severity-based routing maps detection output to actions. Contradictions (severity 4) against grounded context warrant blocking the response or replacing the flagged spans with a disclaimer. Neutral/unverifiable claims (severity 2) deserve a softer treatment — a header annotation, a confidence indicator in the UI, or a logged flag for batch review. Entailed claims (severity 0) pass through.

The header pattern is the lowest-friction integration for existing systems. Rather than modifying response bodies, attach detection metadata as HTTP headers: whether fact-checking was performed, whether hallucinations were detected, which spans were flagged, and the maximum severity score. Downstream consumers — your frontend, your logging pipeline, your safety layer — can each decide how to act on this metadata independently.

Selective abstention is the underused pattern that delivers the biggest accuracy improvement. When your pipeline detects that the provided context is insufficient to answer the query — low context recall, high retrieval uncertainty — the correct behavior is refusing to answer rather than generating a best-guess response. Research shows selective abstention improves real-world accuracy 2–10x over always generating, because removing low-confidence responses raises the accuracy of the responses you do return.

Transparent degradation matters for trust. When your pipeline encounters a fact-seeking query but has no grounding context (no tool results, no RAG passages), don't silently pass the response through. Flag it explicitly as unverified. Users and downstream systems can then apply their own risk tolerance. Silent passthrough is the opposite of safety — it makes the absence of verification indistinguishable from successful verification.

Operationalizing the Pipeline

Building the detection pipeline is half the work. Running it in production introduces engineering constraints that research papers rarely address.

Latency budgets are non-negotiable. For synchronous serving, your detection overhead needs to stay under 200ms at P99 to avoid perceptible delays. Running detection models in-process (via frameworks like Candle for Rust) rather than as separate microservices cuts cold-start time from seconds to under 500ms and reduces memory per model from 2–4GB to under 1GB. True parallelism without Python's GIL is a meaningful advantage at scale.

Context length determines cost. Detection latency scales linearly with context length. At 4K tokens you're comfortable; at 16K tokens you're approaching the upper bound for synchronous detection. For longer contexts, consider asynchronous verification: return the initial response immediately and send detection results as a follow-up, allowing the UI to retroactively annotate flagged claims.

Monitor the monitors. Your detection pipeline has its own failure modes — false positive rates that drift as model behavior changes, NLI models that degrade on domain-specific language, sentinel classifiers that miscategorize new query patterns. Track detection precision and recall on a holdout set of human-labeled examples, and alert when metrics shift. A detection pipeline that generates excessive false positives will be disabled by frustrated engineers faster than one that misses occasional hallucinations.

Close the feedback loop. Every flagged hallucination is a training signal. Aggregate flagged spans by topic, query pattern, and source type to identify systematic weaknesses. If your model consistently hallucinates about dates, that's a targeted fine-tuning or prompt engineering opportunity. If hallucinations cluster around a specific data source, that source's retrieval quality needs investigation. The detection pipeline's long-term value isn't just catching individual hallucinations — it's revealing the patterns that let you prevent them.

The Honest Tradeoff

No detection pipeline eliminates hallucinations entirely. Current best-in-class systems reduce undetected hallucinations by roughly 35% in production — meaningful, but far from complete. The remaining gap is inherent to the technology: models hallucinate because they're generating from probability distributions, not querying a database of facts.

The practical question isn't "how do I prevent all hallucinations?" but "how do I make my system's reliability measurable, improvable, and honest about its limitations?" A system that flags 20% of its responses as potentially hallucinated and lets users verify them is more trustworthy than one that presents every output with equal confidence. Detection, measurement, severity routing, and selective abstention — these four capabilities compose into a pipeline that makes LLM reliability an engineering problem rather than a prayer.

References:Let's stay in touch and Follow me for more thoughts and updates