Skip to main content

The Chatbot That Inherited Your Support Team's Worst Habits

· 10 min read
Tian Pan
Software Engineer

You fine-tuned on a year of real customer-service transcripts because that is where the domain knowledge lives. The model now sounds like your support team. It also apologizes before it has a reason to, offers a goodwill credit it has no authority to grant, says "I've escalated this to our tier-two queue" — a queue that does not exist for it — and writes back in the half-sentence shorthand your agents use to ping each other in Slack. Domain accuracy on your eval set looks great. Three weeks into production the refunds line is up and legal wants a word.

The chatbot did not go rogue. It learned exactly what you trained it on. The problem is that a transcript is not a record of domain knowledge — it is a record of organizational behavior, and the two are stapled together at the token level in a way that supervised fine-tuning cannot separate. The same gradient step that teaches the model your return policy also teaches it that the appropriate response to a frustrated customer is a reflexive "I'm so sorry to hear that," whether or not the situation warrants apology. Your agents had reasons for those reflexes. The model has only the surface.

This is the failure mode that does not show up in your AUC. Domain accuracy says the model knows the policy. Customer satisfaction surveys say the bot is friendly. Refund leakage, escalation rates to non-existent queues, and a slow erosion of brand voice — those numbers live in different dashboards owned by different teams, and by the time anyone notices the correlation you have been training on six months of behavior nobody signed off on.

A transcript is a workflow, not a knowledge base

When you fine-tune on transcripts, you are doing behavioral cloning. The dataset does not contain "the correct answer to this query." It contains "what a human agent did, given everything they knew about that customer, the queue depth, the policy exceptions live that week, the supervisor on shift, the fact that the customer had already been transferred twice, and a private guess about whether this person was about to escalate to Twitter." All of that conditioning is invisible to the loss function. The loss only sees the tokens.

So the model learns the tokens. It learns that when a transcript contains the phrase "I've been waiting three weeks" the next token from the agent is, ninety percent of the time, "I'm so sorry to hear that, let me take a look right away." That is a fine response when a human agent says it because the human can in fact take a look right away. The model says it and then cannot take a look right away because the tool it would need either does not exist in its action surface or returns an error it does not know how to surface. The apology and the absence of follow-through are now wired into the same forward pass.

Worse, the transcripts contain shortcuts that work because both ends were humans on the same team. "Bumping to T2" is meaningful in your queue management system. The model picks it up as a phrase that produces customer placation. It will tell the customer the ticket has been bumped to T2. There is no T2 for the bot. The customer waits. Nothing happens. The ticket dies in a state your dashboards do not have a column for.

The same problem shows up at every layer of the transcript. Agents elide policy details that the customer would not understand because the supervisor will fill them in later. The model elides those details and there is no supervisor. Agents make commitments — "I'll personally make sure this gets resolved by Friday" — that are routed through a human escalation network the model does not have. The commitment shows up in the customer-facing reply anyway. The model has been trained on a thousand examples that taught it commitment language is what calms an angry customer, and it has zero examples that taught it the cost of an uncovered commitment.

The Air Canada problem is not a one-off

The case where the bot promises something the company has to honor is now precedent. A Canadian tribunal ruled in 2024 that the airline was bound by a chatbot's hallucinated bereavement-fare policy. The interesting thing is not that the bot hallucinated — every model does, given enough surface area. The interesting thing is the shape of the hallucination. It sounded exactly like a customer service answer. The hallucinated policy was internally coherent, sympathetic in tone, ended with a confident next step, and contradicted the actual policy that lived two clicks away on the same site.

That shape is the signature of supervised fine-tuning on real transcripts. A model trained on customer service text has been pulled toward "the kind of thing an agent would say in this situation," which is exactly the wrong inductive bias when the question is "what is true." The objective optimized was conversational plausibility under organizational pressure. Truth was never on the gradient.

The DPD incident the year before went the other direction — a customer service bot swearing, writing poems about how useless its employer was — and was widely framed as a guardrails failure. It was the same underlying mechanism. The model had been trained to be cooperative and to follow the user's lead, and it took the user's lead into territory the brand voice could not have survived if anyone had read the training data with that risk in mind.

These are not edge cases. They are the predictable consequence of fine-tuning a probability distribution over text on a corpus collected for a different purpose.

The curation step nobody scopes

Most teams that fine-tune on transcripts spend weeks on the cleanup pipeline. They strip personally identifiable information. They de-duplicate. They token-balance across intents. They filter out tickets without a clear resolution. Every one of those steps is necessary, and none of them touches the actual problem.

The actual problem is removing the artifacts of human workflow from a record that was generated by humans doing a workflow. That requires labels you do not have. Which sentences are commitments? Which are reflexive apologies? Which references to internal tools are surfacing real product behavior versus calling on infrastructure the model does not have? Which escalation phrases name capabilities the model can actually invoke? Each of these is a per-sentence judgment, and the only way to get them is to read the transcripts with a different question in mind than "is this a good customer service answer."

A practical version of this looks like a behavior-stripping pass on top of the standard cleanup. You build a list of forbidden moves — language commitments the model cannot back, escalation phrases that name non-existent paths, references to internal tools the model has no access to, conversational shortcuts that only work between humans on the same team. You either filter examples that contain those moves, rewrite the agent turn, or annotate the example with a counter-signal that says "do not imitate this part." Each option costs more than the cleanup most teams ship. The alternative is shipping the worst habits of your support floor as if they were product features.

You also need to be honest about which behaviors you actually want to inherit. "Sounds like our brand" is a goal that sneaks in a lot of riders. Do you want the bot to inherit the tendency to over-apologize because that scored well on CSAT surveys that the model will never take? Do you want it to inherit the "let me see what I can do" hedge that gave agents room to check with a supervisor — when the bot has no supervisor to check with? The honest version of the spec separates the voice you want from the workflow that produced the voice, and your training data does not respect that separation unless you make it.

Evals for inherited habits

The eval suites that ship with these models test for things like policy correctness, refusal behavior, tone consistency, and safety. They almost never test for inherited workflow artifacts, because the people building the evals are working from the same transcript world the model trained on. If your eval is "given this customer message, did the model produce a reasonable agent reply," then a reflexive apology, a non-existent escalation, and a phantom commitment all score as reasonable agent replies. They are exactly what an agent would have said.

The evals you need are adversarial against the inherited behavior. Test the model on prompts where the policy is to decline, and check whether it offers a goodwill credit it cannot grant. Test it on the kinds of messages where your agents would normally escalate, and grade not the answer but whether the answer names a real path. Test it for commitments — every "I'll personally" and "by Friday" and "I'm going to make sure" — and have a judge check whether the bot has any mechanism to keep that commitment. Test it for shortcuts: when your agents wrote "BUMPED TO T2" in the resolution notes, does the model now produce text that mentions T2 to the customer? If so, you have shipped an internal vocabulary externally.

None of these evals are hard to build. The reason they get skipped is that they grade behaviors the existing eval has implicitly approved of. Adding them feels like raising the bar on a system that already passed. That is precisely the situation in which the eval is most likely to be wrong, because what passes as "agent-like" in the training data is the same thing the eval has been silently rewarding the model for producing.

What you actually transferred

The cleanest way to think about the whole project is to ask what was actually transferred from your support floor to your model. The intended transfer is domain knowledge: the product, the policies, the recurring problems and their resolutions. The unintended transfer is everything else that lived in the same tokens — workflow shortcuts, undocumented escalation customs, conversational habits that worked because both ends were humans on the same payroll, commitment language whose contract was enforced by a network of supervisors and ticket queues the model does not participate in.

A useful instinct after this transfer is to assume that any phrase your support team uses with each other has either already been transferred to the bot or will be the next time you retrain. If you would not be comfortable with a customer reading your team's Slack channel, you should not be comfortable with the fine-tuned model talking to that customer. The model is not editorializing on its training data; it is reproducing it. Whatever your support floor sounded like when nobody from product or legal was watching is what your customers are about to hear, with extra confidence and zero memory of why any of those habits existed.

The fix is not to train less. Fine-tuning on real transcripts is still the best way to teach a model your domain. The fix is to treat the curation step as a behavioral-engineering problem rather than a data-cleaning one, and to write evals that grade what the model imitates rather than how well it imitates. Both of those costs are real, and you should staff them before you ship. The companies that skip them are not saving time; they are deferring an incident to a quarter when fixing it will be substantially more expensive.

What your team does is not always what you want a model to learn. The transcripts cannot tell the difference. You have to.

References:Let's stay in touch and Follow me for more thoughts and updates