Why Your AI Sounds Wrong Even When It's Technically Correct
A logistics chatbot received a message from a customer whose shipment had been lost for a week. The reply came back: "I'm not trained to care about that." Factually accurate. The system had correctly parsed the query, correctly identified that it lacked routing to address the issue, and correctly communicated its limitation. The answer was technically correct in every measurable sense. It was also a product disaster.
This is the register problem — and it's the failure mode your evals almost certainly aren't measuring.
Register, in linguistics, is the variety of language appropriate to a specific situation, relationship, and purpose. It isn't just politeness. It's the entire communicative contract: how formal you are, how much empathy you express, how confidently you assert information, whether you lead with facts or feelings. A customer support agent, a legal reviewer, a medical assistant, and a developer-facing CLI tool should each sound completely different — not because they have different knowledge, but because they operate under different communicative expectations.
When engineers build LLM-powered products, they almost always optimize for the knowledge problem. They tune for accuracy, context retrieval, hallucination reduction, and task completion. Register is treated as a vibe — something you add a line about in the system prompt ("be friendly and professional") and consider solved. It isn't. Register mismatch is a silent churn driver that hides behind user feedback like "it just feels off" and "the bot doesn't really understand," and it almost never shows up in your quantitative eval suite.
Why Your Evaluation Suite Is Blind to This
The standard toolkit for evaluating LLM outputs — BLEU, ROUGE, accuracy scores, factual correctness checks — was built to measure one thing: whether the right information appeared in the output. These metrics count lexical overlap, check for presence of key facts, and verify that answers match reference outputs. They are entirely indifferent to how the information was delivered.
A response that says "Your order has been delayed due to supply chain disruption" scores identically to one that says "Hey! Looks like something happened with your order supply-chain-wise — it'll be a bit late, sorry about that :)" under most automated metrics. Both are factually identical. Whether either is appropriate depends entirely on context — the platform, the user's emotional state, the brand voice, the severity of the problem — none of which BLEU can see.
Research confirms this blind spot explicitly. Studies on LLM evaluation have found that metrics measuring accuracy are insufficient for conversational AI evaluation because they're blind to tone, coherence across turns, and emotional attunement. Human evaluators consistently catch things that automated metrics miss: tone shifts that feel jarring, formality mismatches, register inconsistency between one response and the next. The gap is structural, not accidental.
The consequence is that teams ship products where every A/B test on accuracy passes, every factual eval is green, and users still churn. The problem isn't in any logged metric. It lives in the gestalt of what the product feels like to interact with.
The Five Registers (and Where AI Products Fail)
Sociolinguistics describes five broad registers on a formality spectrum: frozen (ceremonial, unchanging), formal (professional writing, legal documents), consultative (collaborative professional contexts), casual (friendly conversation), and intimate (close personal relationships). Most AI products need to operate somewhere in the consultative-to-casual range depending on context.
The failure modes cluster around a few patterns:
Too formal for the context. A consumer product that talks like a legal document creates distance. Users don't feel helped — they feel processed. This is common in enterprise SaaS products that inherit system prompts from legal and compliance reviews, which systematically push language toward hedging, qualifications, and passive voice.
Too casual for the stakes. A financial or medical assistant that responds in a breezy, conversational register undermines its own authority. Users have an instinctive sense that the tone doesn't match the gravity of what they're asking about. Even if every answer is correct, the register signals that the system isn't taking the situation seriously.
Mismatched to emotional context. The cheerful customer support chatbot that opens with "Hi there! 😊 How can I make your day amazing?" when a user is reporting a service outage is doing more damage than a neutral response would. A 2026 University of South Florida study found that empathetic responses from chatbots can actually trigger psychological reactance — users perceive the emotional attunement as intrusive or manipulative when it comes from a nonhuman system, and it decreases satisfaction rather than increasing it. Getting the empathy dial wrong in either direction creates friction.
Consistent in text, inconsistent in turns. Multi-turn conversations reveal register instability that single-turn evals never catch. A chatbot can start formal and drift casual as the context window fills. It can start empathetic and turn clinical when it runs out of emotional scaffolding in its prompt. These shifts feel disorienting even when individual responses look fine in isolation.
The Prompt Rot Problem
Register inconsistency often isn't a model problem — it's an engineering process problem. System prompts accumulate patches.
The trajectory is familiar: a product manager notices the bot sounds cold, adds "be warm and empathetic." Legal adds a disclaimer clause. Support adds handling for a specific edge case. Engineering patches a formatting issue. Marketing adds a brand voice note. Each change makes sense in isolation. Collectively, they turn the system prompt into a hydra of contradictory instructions that the model must reconcile on every request.
A prompt managing five distinct behavioral concerns — tone, legal hedging, brand voice, formatting, and task-specific edge cases — is already architecturally unstable. The model doesn't have a consistent register to draw from; it has a set of competing instructions and will resolve the tension differently depending on the specific input. The result is inconsistency that users experience as the bot having an unpredictable personality.
This is measurably distinct from a model capability problem. The same base model, given a well-structured system prompt with a clear and consistent register specification, will produce reliably consistent tone. The same model, given a prompt with five layers of conflicting tone instructions, will produce the register equivalent of signal noise.
The engineering discipline this requires is treating register as a first-class architectural concern in your prompt — not a line you add, but a section you maintain with the same rigor as your task instructions. When a new behavioral requirement comes in, the question shouldn't be "where do I add this?" but "does this fit the existing register, and if not, how do we reconcile the conflict?"
How to Specify Register as a Requirement
The CO-STAR prompt framework makes a useful distinction that most other frameworks collapse: it separates Style (structural presentation — formal, journalistic, conversational) from Tone (emotional quality — confident, empathetic, urgent, clinical). Most system prompts treat these as one thing. Separating them gives you two distinct levers.
For a customer support tool: Style is "direct, first-person, clear sentences without jargon." Tone is "patient and matter-of-fact, no excessive warmth." This is different from a developer documentation assistant, where Style might be "precise technical prose" and Tone is "neutral, assumes competence." Both are "professional," but they sound different in practice — and conflating them into a single "be professional" instruction produces neither reliably.
Few-shot examples are more reliable than prose instructions for register control. Telling a model to be "warm but authoritative" is ambiguous — models have to interpret what that means and will interpolate differently on different inputs. Showing three examples of the exact register you want is far less ambiguous. The model extrapolates the pattern rather than interpreting abstract instructions. Place your strongest example last; models weight later examples more heavily.
For multi-turn register consistency, it's worth testing tone explicitly across turn sequences rather than only single responses. A conversation that starts with a difficult question, moves through a frustrated follow-up, and ends with a resolution request is a natural test for whether your system prompt's register holds up under pressure. If it doesn't, the instability is usually either in the prompt structure or in how user emotional tone bleeds into the system context.
Evaluating What You're Measuring
Adding register to your eval suite doesn't require human review of every response, though human review catches things automation can't. The practical path for most teams is LLM-as-judge on a representative sample.
Define the criteria explicitly: not "is this a good response?" but "does this response maintain the correct formality level for this context?" and "is the emotional tone appropriate for the user's stated situation?" These are evaluable dimensions. A judge model can be prompted to score them on a Likert scale with criteria derived from your style guide. The scores aren't perfect, but they're far more sensitive to register problems than BLEU scores or accuracy checks, and they don't require human reviewers for every run.
The most useful signal, though, is often in the failure data you already have. Support ticket escalations, low-rating responses, and the specific conversations where users disengaged — these cluster around tone problems more than factual problems in most deployed products. When users say "the bot doesn't really understand" without naming a specific factual error, they're describing register mismatch. Routing that feedback into register-specific analysis tends to reveal patterns that prompt fixes can address.
The Core Discipline
Technical correctness is a floor, not a ceiling. A response that gets the facts right but sounds wrong for the context has failed the user interaction even if it passes the eval. The two failure modes are orthogonal — you can have one without the other — and only one of them shows up in the metrics that most teams track.
The discipline register requires is treating communicative appropriateness as a requirement with the same weight as factual accuracy: specified explicitly, tested systematically, monitored in production, and protected from prompt rot. That means writing register into your system prompt as a structured section rather than a vague adjective, testing it across turn sequences rather than single outputs, and building eval criteria that can actually distinguish between "technically correct" and "communicatively right."
The products that feel good to use — where users describe the AI as actually understanding them — aren't the ones with the highest factual accuracy scores. They're the ones where the register is calibrated right for the context and stays consistent. That's an engineering problem, not a model problem, and it has engineering solutions.
- https://www.gorgias.com/blog/ai-chatbot-not-working
- https://www.wildflowerllc.com/chatbots-dont-do-empathy-why-ai-falls-short-in-mental-health/
- https://pmc.ncbi.nlm.nih.gov/articles/PMC12643404/
- https://dev.to/askpatrick/the-prompt-rot-problem-why-your-ai-agent-gets-worse-over-time-1fgj
- https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- https://www.usf.edu/business/news/2026/04-20-chatbot-empathy-can-worsen-customer-reactions-usf-study.aspx
- https://news.ucsc.edu/2025/03/ai-empathy/
- https://deepeval.com/docs/metrics-conversational-g-eval
- https://portkey.ai/blog/what-is-costar-prompt-engineering/
- https://www.userlytics.com/resources/blog/ux-research-the-hidden-driver-behind-successful-ai-chatbots/
