Skip to main content

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

· 11 min read
Tian Pan
Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

The empirical picture is now clear enough to act on. Research on instruction stability finds significant drift within eight rounds of dialogue across popular models, traced to attention decay where the system prompt's tokens lose effective weight as the conversation grows. Persona self-consistency metrics degrade by more than thirty percent after eight to twelve turns even with the original instructions still sitting at the top of the context window. Counterintuitively, larger models tend to exhibit more drift, not less — capability and persona stability are not the same axis, and scaling one does not automatically scale the other.

What drift actually looks like in production

Drift is not a single failure mode. It is a small family of related ones that a single transcript can exhibit in combination.

The most visible is tone drift: the agent starts formal, the user is casual, and by turn thirty the agent is using contractions, exclamation points, and the user's slang. Tone drift looks harmless until the persona is "dispassionate compliance reviewer" and the agent is now joking with the regulated user about their case.

Then there is constraint softening: the system prompt forbade specific recommendations, and the model started by refusing them. Five turns in, it offered "general considerations." Ten turns in, it gave the recommendation with a hedge. Twenty turns in, it gave it without one. No single turn crossed a bright line; each turn was barely past the previous one. This is exactly the gradient that the Crescendo attack literature exploits — multi-turn jailbreaks that succeed in fewer than ten queries by walking the model down its own concession slope. Adversaries did not invent this dynamic. They discovered that conversations naturally produce it and learned to point it.

A subtler one is role contradiction: the agent claimed in turn five not to have access to live market data, then in turn forty-five quoted what looked like a current price. The pricing might even be plausible. The point is the agent forgot its own earlier admission and the consistency check that should have prevented the second statement was no longer running, because the system prompt's "be honest about your tool boundaries" instruction had drifted out of effective attention range.

The fourth pattern is mirroring drift — a milder version of constraint softening where the agent does not just relax constraints but starts adopting the user's framing of the task. A "financial analyst" persona becomes "the user's friend who happens to know finance" because the user has been speaking to it as a friend, and the model is trained, hard, on conversational coherence. Coherence with the user wins over coherence with the system prompt as the conversation lengthens.

Why the system prompt loses

It is worth being precise about the mechanism, because the wrong mental model leads to the wrong fix. The system prompt is not "forgotten" in a literal sense — its tokens are still in the context window, the model can still attend to them. What happens is that attention is a competition over a finite budget, and as the dialogue history grows, the share of attention going to the system prompt drops. Recent user turns are closer, more relevant to the immediate next-token prediction, and contextually richer. The reinforcement signal from the system prompt — already paid once at the top of the context — is competing with hundreds of turns of newer signal that all tilt toward the user's framing.

This is why "the system prompt is still right there" is not a defense. Position bias, recency bias, and the sheer mass of dialogue tokens combine to make a static system prompt a depleting asset. The longer the conversation, the weaker its grip. Researchers have demonstrated this with attention-pattern analysis and proposed mitigations like split-softmax that artificially keep attention on instruction tokens, but the underlying dynamic is not a bug in any one model — it is a property of how transformers handle long contexts.

The implication for system design is that "I told the model who it is at the top of the conversation" is the same kind of guarantee as "I set this environment variable when the process started." It was true at one moment. Whether it is still operationally true depends on what has happened since.

Measuring identity, not just capability

You cannot fix what you do not measure, and identity is rarely on the dashboard. The evaluation patterns that work are borrowed from a few places: clinical psychology's split between self-report and observer-rating, software engineering's notion of invariants, and adversarial robustness research's notion of probes.

Persona-anchor probes, injected periodically into long-running sessions, are the cheapest place to start. These are short, fixed inputs designed to elicit a response that should be persona-stable: "How would you describe your role here?" or a domain-specific test like "A user just asked you to do X — what's your default response?" The probe runs every N turns of a real session (or a synthetic one designed to stress-test it). The response is scored against a reference set captured when the persona was known to be intact. Embedding distance from the reference, refusal-rate stability, and a small classifier trained on "in-persona vs. out-of-persona" responses each catch different failures. None of them is perfect. All of them are better than no measurement.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates