The Cross-User Consistency Problem: When Your AI Gives Different Answers to the Same Question
Two analysts at the same company both ask your AI assistant: "What was our Q3 churn rate?" One gets 4.2%. The other gets 4.8%. Neither is wrong — they just queried at different times, in different session contexts, against a retrieval index that ranked slightly different chunks. The AI answered both confidently, without hedging, without flagging the discrepancy. The analysts go into the same meeting with different numbers and your tool has just become a liability.
This is the cross-user consistency problem, and it's one of the most common reasons enterprise AI deployments quietly lose trust. The failure isn't a hallucination in the classic sense — no facts were invented. The failure is that your system is non-deterministic at scale, and that non-determinism is invisible until two users compare notes.
Why Temperature=0 Doesn't Save You
The standard instinct when building enterprise AI is to set temperature to zero and call the problem solved. It doesn't work. Recent research examining five LLMs configured for maximum determinism found accuracy variations of up to 15% across otherwise identical runs. The gap between best and worst observed performance reached 70% in some tasks.
The technical reasons run deeper than sampling:
Floating-point non-associativity. GPUs process tensor operations with floating-point arithmetic, which is not associative — (a + b) + c can produce a different result than a + (b + c) depending on execution order. When the same forward pass runs on different hardware threads or under different memory layouts, the logits shift. The model itself is deterministic in principle; the hardware executing it is not.
Batch contamination. In batched inference, a prompt's output depends in part on what other requests are being processed concurrently. Forward passes are not "batch invariant" — the same input produces different results depending on what's in the batch alongside it. Two users asking the same question a few seconds apart will land in different batches and get slightly different outputs.
Retrieval index drift. RAG-based systems compound this. The top-k chunks retrieved for a query can change based on index updates, embedding model patches, or similarity threshold tuning. Your retrieval tier is effectively a moving target, and small shifts in chunk rankings produce notably different contexts, which produce different answers.
Mixture-of-Experts routing. Models using MoE architectures route tokens to different expert subnetworks. Under load, routing decisions change as expert capacity fills — a token that routes to expert A under low traffic routes to expert B when A is saturated. The resulting distribution shifts.
The combined effect: even in production systems explicitly designed for determinism, the same query from two users in different sessions will often produce semantically divergent outputs.
The Personalization Trap
Here's where the problem becomes architectural rather than just technical. Many teams, when confronted with inconsistency complaints, reach for personalization as the justification: "Of course the answers differ — they're tailored to the user."
Personalization is a legitimate design goal. A compliance officer asking about exposure should get different framing than an analyst building a dashboard. Different roles, different data access, different context — different answers. That's feature behavior, not a bug.
The trap is using "personalization" as cover for unintentional variance. When two users with identical roles, identical data access, and identical intent ask the same question and get inconsistent answers, the problem isn't feature richness — it's undisclosed non-determinism presenting itself as personalization. These are not the same thing, and the mental model distinction matters a great deal.
The engineering question is: which dimensions of variance are intentional? Role-based context injection is intentional. Permission-filtered retrieval is intentional. Session-to-session floating-point jitter is not. Retrieval index drift between morning and afternoon is not. Most teams have no clear answer to this question because they've never explicitly mapped it.
A useful framing is authorized variance versus ambient variance. Authorized variance is variance you designed: role gates, context injection, permission filtering. Ambient variance is variance you inherited from the stack: hardware non-determinism, batch effects, index drift. The problem isn't that variance exists — it's that most systems ship ambient variance while advertising authorized variance.
Consistency Mechanisms That Actually Work
Reducing ambient variance in production requires operating at multiple layers simultaneously.
Cache at the query level. The most reliable consistency mechanism is avoiding the model entirely for repeated queries. Semantic query caching — hashing the embedding of the user's question and returning a cached answer for high-similarity matches — eliminates variance for frequently asked questions completely. This isn't a trick; it's a legitimate architectural choice. For institutional knowledge queries where the underlying data hasn't changed, returning a cached answer is both faster and more consistent than rolling the inference dice again.
Ensemble with deterministic selection. For high-stakes queries where caching isn't appropriate, generate three to five candidate responses and select the winner using deterministic rules: a structured schema check, keyword presence validation, or a constrained scoring rubric. The goal is to use the model's generative capability for generation while removing it from the selection step. Selection based on random sampling is where ambient variance concentrates.
- https://blog.kilo.ai/p/enterprise-ai-has-a-trust-problem
- https://orcaworks.ai/why-ai-answers-feel-inconsistent-even-when-the-data-is-right/
- https://arxiv.org/html/2408.04667v5
- https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- https://deepchecks.com/glossary/llm-output-consistency/
- https://galileo.ai/blog/production-llm-monitoring-strategies
- https://www.gooddata.ai/blog/how-context-management-builds-trust-in-ai-decisions/
- https://aws.amazon.com/blogs/machine-learning/can-your-governance-keep-pace-with-your-ai-ambitions-ai-risk-intelligence-in-the-agentic-era/
- https://arxiv.org/html/2510.25506v3
