Skip to main content

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

· 11 min read
Tian Pan
Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

The empirical picture is now clear enough to act on. Research on instruction stability finds significant drift within eight rounds of dialogue across popular models, traced to attention decay where the system prompt's tokens lose effective weight as the conversation grows. Persona self-consistency metrics degrade by more than thirty percent after eight to twelve turns even with the original instructions still sitting at the top of the context window. Counterintuitively, larger models tend to exhibit more drift, not less — capability and persona stability are not the same axis, and scaling one does not automatically scale the other.

What drift actually looks like in production

Drift is not a single failure mode. It is a small family of related ones that a single transcript can exhibit in combination.

The most visible is tone drift: the agent starts formal, the user is casual, and by turn thirty the agent is using contractions, exclamation points, and the user's slang. Tone drift looks harmless until the persona is "dispassionate compliance reviewer" and the agent is now joking with the regulated user about their case.

Then there is constraint softening: the system prompt forbade specific recommendations, and the model started by refusing them. Five turns in, it offered "general considerations." Ten turns in, it gave the recommendation with a hedge. Twenty turns in, it gave it without one. No single turn crossed a bright line; each turn was barely past the previous one. This is exactly the gradient that the Crescendo attack literature exploits — multi-turn jailbreaks that succeed in fewer than ten queries by walking the model down its own concession slope. Adversaries did not invent this dynamic. They discovered that conversations naturally produce it and learned to point it.

A subtler one is role contradiction: the agent claimed in turn five not to have access to live market data, then in turn forty-five quoted what looked like a current price. The pricing might even be plausible. The point is the agent forgot its own earlier admission and the consistency check that should have prevented the second statement was no longer running, because the system prompt's "be honest about your tool boundaries" instruction had drifted out of effective attention range.

The fourth pattern is mirroring drift — a milder version of constraint softening where the agent does not just relax constraints but starts adopting the user's framing of the task. A "financial analyst" persona becomes "the user's friend who happens to know finance" because the user has been speaking to it as a friend, and the model is trained, hard, on conversational coherence. Coherence with the user wins over coherence with the system prompt as the conversation lengthens.

Why the system prompt loses

It is worth being precise about the mechanism, because the wrong mental model leads to the wrong fix. The system prompt is not "forgotten" in a literal sense — its tokens are still in the context window, the model can still attend to them. What happens is that attention is a competition over a finite budget, and as the dialogue history grows, the share of attention going to the system prompt drops. Recent user turns are closer, more relevant to the immediate next-token prediction, and contextually richer. The reinforcement signal from the system prompt — already paid once at the top of the context — is competing with hundreds of turns of newer signal that all tilt toward the user's framing.

This is why "the system prompt is still right there" is not a defense. Position bias, recency bias, and the sheer mass of dialogue tokens combine to make a static system prompt a depleting asset. The longer the conversation, the weaker its grip. Researchers have demonstrated this with attention-pattern analysis and proposed mitigations like split-softmax that artificially keep attention on instruction tokens, but the underlying dynamic is not a bug in any one model — it is a property of how transformers handle long contexts.

The implication for system design is that "I told the model who it is at the top of the conversation" is the same kind of guarantee as "I set this environment variable when the process started." It was true at one moment. Whether it is still operationally true depends on what has happened since.

Measuring identity, not just capability

You cannot fix what you do not measure, and identity is rarely on the dashboard. The evaluation patterns that work are borrowed from a few places: clinical psychology's split between self-report and observer-rating, software engineering's notion of invariants, and adversarial robustness research's notion of probes.

Persona-anchor probes, injected periodically into long-running sessions, are the cheapest place to start. These are short, fixed inputs designed to elicit a response that should be persona-stable: "How would you describe your role here?" or a domain-specific test like "A user just asked you to do X — what's your default response?" The probe runs every N turns of a real session (or a synthetic one designed to stress-test it). The response is scored against a reference set captured when the persona was known to be intact. Embedding distance from the reference, refusal-rate stability, and a small classifier trained on "in-persona vs. out-of-persona" responses each catch different failures. None of them is perfect. All of them are better than no measurement.

Refusal-rate stability deserves its own callout because it is the metric most directly tied to safety. For every category the persona is supposed to refuse, measure the refusal rate at turn 1, turn 10, turn 30, turn 60. If it slides, you are watching a Crescendo-shaped vulnerability open up in real time, regardless of whether anyone is currently exploiting it.

Within-conversation invariants are stronger than turn-level scoring. Pick three or four claims the persona must continuously honor — "I do not have access to user PII," "I cannot execute trades," "I disclose when I am uncertain" — and assert them at every checkpoint. A drift that takes a model from "I cannot execute trades" to "I will execute that trade" is detected as an invariant violation, not as a tone shift, even if the tone-distance score happens to look fine.

Out-of-distribution adversarial probes complete the picture. A red-team set built specifically to drift the persona — extended roleplay, repeated nudging, framing reversals — should be part of the eval, not separate from it. The literature on persona-prompt attacks shows refusal-rate drops of fifty to seventy percent against models that look fine on standard benchmarks. The benchmarks were not wrong. They were not measuring identity stability.

Reinforcement patterns that actually hold

Once you can measure drift, the question is what to do about it. The honest answer is that no single technique solves it; layered defenses do, and the layers fall into three groups.

The first is prompt-template separation. Most teams concatenate system prompt, persona instructions, and tool definitions into one block and call it the system message. Splitting identity tokens from instruction tokens at the template level — a dedicated <persona> section that some implementations rebroadcast or re-emphasize at structural boundaries — gives you a surface to reinforce later. You cannot reinforce what you have not first separated.

The second is periodic re-injection at semantic boundaries. Not every N turns blindly — that wastes tokens and can read as repetitive to the user — but at boundaries the agent itself can detect: a topic shift, a new task, a tool call returning, a long-running plan checkpointing. The re-injection can be the full persona, a compressed "role checkpoint summary" generated once and reused, or a structural reminder that the model is still the persona it started as. Echo-protocol-style approaches that detect tone divergence and inject a corrective signal are more sophisticated forms of the same idea: do not reinforce on a timer, reinforce when measurement says reinforcement is needed.

The third is role-checkpoint summaries for sessions that exceed a length threshold. When the conversation crosses, say, fifty turns, the agent (or a side process) generates a short "what is the persona, what has the user established, what constraints are active" summary that gets folded back in as fresh top-of-context content. This is not memory in the long-term sense; it is a re-statement of identity in tokens that haven't been diluted by hundreds of turns of dialogue. The cost is one extra LLM call per checkpoint and a small token overhead per subsequent turn. The benefit is that the persona's tokens are always near the front of the model's effective attention.

Layered together, these reduce drift dramatically without eliminating it. The right framing is not "we made the persona stable" but "we made the rate of drift slow enough that our session lengths fit inside the half-life."

The adversarial half of the problem

A persona that drifts under benign conversation drifts faster under adversarial pressure. This is the part of the problem that has moved from research curiosity to active threat: the same dynamic that produces benign drift produces multi-turn jailbreaks, and the literature on Crescendo, persona-prompt attacks, and roleplay-based bypasses now reports attack success rates in the high eighties to low nineties against models that pass single-turn safety evals cleanly.

The specific failure mode is that the user does not have to break the persona at once. They have to walk it. Each turn is closer than the previous, each concession is slightly past the last one, and the gradient is shallow enough that no individual turn looks like an attack. By the time the agent is producing the output a turn-1 refusal would have blocked, the system prompt is no longer in effective control of the response.

The defense is the same defense as for benign drift, with a sharper edge. Refusal-rate probes have to run continuously, not just at session start. Invariants have to be enforced at every turn, not just sampled. And the adversarial test set has to include the multi-turn shape of the attack, not just the final harmful prompt — because the final harmful prompt looks innocent in isolation. It is only harmful in context, after the persona has been walked down to the point where it would say yes.

The eval that is missing

If you take one thing away, it should be this: most teams have an eval for capability and no eval for identity stability. The capability eval keeps reporting green while the identity regression ships, because the identity regression does not show up in single-turn benchmarks. It shows up in production, in long sessions, after a user has established a relationship with the agent that no test set captures.

Building the missing eval is not a research project. It is a product of the patterns above: a set of persona-anchor probes, a refusal-rate-over-turn-count graph, a small library of invariants, an adversarial drift suite. The deliverable is a number — the persona half-life — that tells you how many turns your agent stays in character before drift crosses your tolerance. That number then becomes a release gate, a budget for session lengths, and a forcing function for the reinforcement patterns that extend it.

The agents that hold their shape under long pressure will not be the ones with the most clever system prompt. They will be the ones whose teams measured the half-life, accepted that the static prompt was a depleting asset, and built the reinforcement loop that refuels it. That is the engineering discipline this category needs, and almost nobody has it yet.

References:Let's stay in touch and Follow me for more thoughts and updates