Skip to main content

The Persona Your System Prompt Offered That the Model Picked the Same Way Every Time

· 10 min read
Tian Pan
Software Engineer

A product team I talked to recently ran a three-arm A/B test on response personas — concise, thorough, conversational — for three weeks across every cohort. The system prompt described all three and asked the model to pick the one that best matched the user. When they opened the dataset to write the readout, one number stopped them cold: the "thorough" arm had 91% of the traffic. The other two were rounding error.

Their experiment platform had not flagged anything. No alert fired. The pipeline did exactly what they had told it to do. Three weeks of supposed multi-persona testing had produced a dataset that could only tell them about thorough. The other two arms were too thin to power any inference at all.

The instinct in the room was that the prompt needed work — better instructions, sharper distinctions between personas, a more deliberate example for the conversational case. That diagnosis would have been right ten years ago in a rules-driven router. It is wrong for a model. The prompt was not the variable. The router was.

When the model is the randomizer, it isn't random

If you have ever read a paper on causal inference, you know the cardinal rule of A/B testing: the treatment assignment has to be independent of the unit. You toss the coin outside the system you are measuring. Once you let the system you are measuring participate in the toss, the coin is no longer fair, and the comparison you think you are running is not the comparison you are actually running.

Asking a language model to choose a persona is exactly that violation. The model is not a coin. It is a high-dimensional function whose prior over its own behavior is shaped by terabytes of training data. When you write "pick concise, thorough, or conversational based on the user," you are not delegating to a random process. You are deferring to a learned preference that was baked in months before your product existed.

The literature on this is not subtle. Verbosity bias in LLM outputs is documented across model families and training regimes — RLHF, DPO, and supervised fine-tuning on long chain-of-thought traces all reinforce a tendency toward longer, more thorough responses, often regardless of whether the added length improves quality. Verbosity compensation has been measured at frequencies between 13.6% and 74% depending on task and prompt, and the pattern persists even when prompts explicitly request concision. The model's prior on output length is not weak. It is, in many production settings, the strongest signal in the room.

So when your system prompt offers three personas as a free choice, the model is not running your experiment. It is running an observational study of its own training preferences, and reporting the results as if the user had asked for them.

Why your experiment platform didn't notice

The most uncomfortable part of the story is not that the imbalance happened. It is that the imbalance went undetected for three weeks in a platform that was, in principle, instrumented to catch it.

Sample Ratio Mismatch detection — SRM — is the standard guardrail against this exact failure mode. A chi-squared test against the expected allocation, run continuously on accumulated traffic, will refuse to score an experiment whose split has drifted past a reasonable threshold. SRM detectors are not exotic; they ship in every modern experimentation platform, and the recommended significance threshold is well-established at around p < 0.01.

The reason the detector did not fire is simple. The platform was watching the arm the user was assigned to, not the arm the model picked. From the platform's point of view, every user who entered the experiment was assigned to a single "persona-router" treatment, and the persona that came out the other end was a property of the model's response, not a property of the experimental allocation. The randomizer that mattered — the model — was downstream of the randomizer the platform was monitoring.

This is the structural lesson. SRM detection only works when the assignment it is checking is the assignment that actually drives the comparison. Any experiment whose effective arms are determined inside an opaque function downstream of the platform's randomizer is invisible to SRM by construction. The platform is not broken. It is correctly silent about a question it was never asked.

The architectural fix: route outside the model

The cleanest correction is also the most boring. Persona assignment should be performed by your application, with the platform's randomizer, before the model is invoked. The chosen persona is then passed into the system prompt as a parameter the model must obey, not as a menu it may choose from.

In practical terms, the difference looks like this. The wrong shape is a system prompt that says: "Pick concise, thorough, or conversational based on the user's apparent preference." The right shape is a system prompt that says: "Respond in the {persona} persona. Definition: {definition}." The persona variable is filled by the same router that assigns every other experimental treatment in your product.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates