Synthetic Users for Multi-Turn Agent Eval: When Your Test Fixture Has To Push Back
Single-turn evals are great at one thing: ranking models on tasks where the user types once and walks away. They are useless for the failure modes you actually ship with. The agent that loses track of the user's goal by turn three. The agent that capitulates under polite repetition ("are you sure? could you check again?") and reverses a correct answer. The agent that asks the same clarifying question on turn four that it already asked on turn two, because it can't read its own history. None of these show up in a benchmark where the conversation ends after one exchange.
You can run real-user eval, but it costs hundreds of hours of human review per release and surfaces problems three weeks after they shipped. Or you can build LLM-driven synthetic users — bots with personas, goals, patience, and abandonment thresholds — and run thousands of conversations against a candidate agent every night. This is the approach behind τ-bench, AgentChangeBench, and most production-grade conversational eval setups in 2025–2026. It works, until it doesn't, and the ways it stops working tell you more about your eval pipeline than they do about the synthetic user.
What a Synthetic User Actually Has To Be
A useful synthetic user is not "a prompt that says you are a customer." It is four things, each of which has to be specified before the simulator becomes load-bearing in CI.
Persona. Demographic and behavioral attributes — domain expertise, vocabulary register, emotional tone, willingness to provide details unprompted. Personas should be seeded from clusters of real production traffic, not invented from imagination. AgentChangeBench, for example, defines five personas with explicit behavioral traits, interaction styles, and cooperation levels — not because five is the right number, but because one is obviously wrong.
Goal. What the simulated user is trying to accomplish. The goal must be specific enough to score against (cancel order #4421 and request a refund to original payment method), and the agent must not be told the goal — it has to be elicited through conversation. τ-bench's evaluation works precisely because it compares the database state at the end of a conversation against the annotated goal state, not the agent's self-report.
Patience budget. How many turns the user will tolerate before getting frustrated. How they express that frustration — terser replies, repeating themselves, threatening to leave. Real users have finite patience, and the failure mode of a too-patient simulator is a pipeline that scores agents as 'eventually correct' on a path no real customer would walk.
Abandonment threshold. When the user gives up. This is calibrated against production drop-off curves — if 40% of your real users abandon after three unanswered clarification questions, your simulator should too. Without this, you optimize for a population that politely tolerates whatever the agent puts them through.
If three of these four are missing, you do not have a user simulator. You have an LLM monologue with one role wearing a hat.
The Architecture That Catches Real Failures
The most useful synthetic-user pipelines look more like a small distributed system than a prompt template:
- Persona library seeded from real-traffic clusters. Cluster on intent, vocabulary, and turn-length distributions, then sample personas proportionally. This guarantees that "long-winded confused first-time user" gets eval coverage in proportion to how often they actually arrive.
- Goal injection that mimics how users actually arrive. Real users do not announce their goal in turn one ("I would like to cancel order 4421 and receive a refund"). They arrive obliquely ("hi, I have a problem with something I bought last week"). Encoding this delay forces the agent through clarification turns it would otherwise skip.
- Patience and abandonment sampling. Each conversation samples a patience budget and abandonment threshold from a distribution calibrated against production logs. Conversations that hit the threshold end as user-abandoned, not agent-failed — and the agent's score reflects that.
- Trajectory-level judges. Don't score per-turn quality and average. Score the conversation. Did the user reach their goal? Did the agent contradict itself? Did it ask the same question twice? Was the user's emotional state at the end better, worse, or the same as the start? These are properties of the trajectory, not any single turn.
The judging-by-trajectory part is where most teams are still rebuilding 2024-era infrastructure. Per-turn quality scores aggregate into numbers that look great while the conversation as a whole is incoherent. A goal-success-rate metric, plus a root-cause-of-failure taxonomy, is roughly the floor for a useful pipeline.
The Hall of Mirrors Trap
Here is where it gets interesting. The single biggest failure mode of synthetic-user eval is not bad personas — it's that the simulator and the agent share priors. They were trained on the same internet, fine-tuned with overlapping data, prompted by people from the same engineering culture. So when the simulator generates a question, the agent finds it convenient to answer; when the agent emits a wandering response, the simulator finds it polite to accept.
The eval becomes a hall of mirrors. Both sides are LLMs, both sides know the conventions, and the conversation feels coherent because two well-aligned models are co-producing it. Then a real user arrives, asks the same thing in a different register, and the agent falls apart.
A 2026 study titled Lost in Simulation quantified this directly: agent success rates can vary by up to nine percentage points purely from swapping which LLM is playing the user, and the simulators systematically underestimate agent performance on hard tasks while overestimating it on medium-hard ones. Worse, simulators introduce conversational artifacts — phrasings, hedges, completion patterns — that real users do not produce, and the agent's strengths against those artifacts have nothing to do with how it performs against humans. The same study showed measurable disparity for users speaking African American Vernacular English versus Standard American English: the simulator is silently more competent at the dialect it was trained on, and its measurements of the agent inherit that asymmetry.
Practically, this means the most important metric in your synthetic-user pipeline is not goal-success-rate against the simulator. It is simulator-versus-production divergence: a regular benchmark where you compare the simulator's distribution of conversations (turn lengths, abandonment rates, clarification patterns, lexical diversity) against actual production logs. If they drift, your eval is no longer measuring what you think it is measuring, and any improvement on the synthetic benchmark needs to be re-validated before you trust it.
Calibration Discipline: The Part Nobody Wants To Pay For
A synthetic-user pipeline is calibrated, not configured. That distinction is where most teams underinvest.
Calibrate patience and abandonment against production drop-off curves. If real users abandon after the agent's third unhelpful response 30% of the time, your simulator should too. The numbers come from your logs, not from the prompt-engineer's intuition. Re-calibrate quarterly, because user behavior changes — especially after a UI change or a competitor launch.
Calibrate goal distributions against real intent clusters. The bug-report mix from your production traffic should match the bug-report mix in your eval set. If 12% of real conversations are "where is my refund," 12% of your synthetic conversations should be too. When that ratio drifts (and it will), the eval ranks improvements that don't matter and misses regressions that do.
Calibrate vocabulary against real-traffic n-grams. Synthetic users default to a clean, mid-register vocabulary. Real users include misspellings, regional phrasing, code-switching, mid-sentence corrections, and emojis. A simulator that strips all of that produces an eval that ranks agents on their performance against a population that doesn't exist.
Calibrate against held-out human-judged conversations. Quarterly, take a hundred real conversations, have humans rate them on the same trajectory rubric your synthetic-user judge uses, and check whether your synthetic-user judge agrees. If the correlation drops below where it was last quarter, your judge has drifted, your simulator has drifted, or both — and the right move is to fix the judge before you trust another release decision.
None of this is glamorous. None of it is the part of the pipeline that gets a flashy demo. All of it is the difference between an eval that catches regressions before users do and an eval that congratulates you while the production graphs sag.
The Org Failure Mode To Watch For
The classic way this whole apparatus goes wrong is straightforward: the team fine-tunes the agent against the synthetic-user population, the agent gets very good at answering the kinds of questions that synthetic population asks, and the agent ships a regression on the kinds of questions real users ask. The CI graph went up. The retention graph goes down. The team stares at the divergence and slowly realizes that the simulator was the model of the user, the agent was optimized against the model, and the real user was never in the loop.
The defenses are mostly procedural. Treat simulator-versus-production divergence as a release-blocking metric, not a vibes-check dashboard. Refresh the persona library from real traffic on a fixed cadence, not "when someone notices it's stale." Keep a small, expensive, slow human-eval cohort that runs on every release candidate, even after the synthetic-user pipeline scales — especially after, because that's when the temptation to skip the human check is strongest.
A synthetic-user eval that pushes back well — that has personas your real users would recognize, patience budgets calibrated against your real drop-off curves, and a judge that correlates with human raters — is one of the highest-leverage pieces of infrastructure a serious agent team can build. A synthetic-user eval that doesn't is an expensive way to ratify your agent's existing behavior. The infrastructure looks identical from the outside. The difference is whether anybody on the team is doing the unglamorous calibration work, every release, in perpetuity.
Where This Is Going
The honest end-state is hybrid. Synthetic users for the inner loop — fast, cheap, every PR, with broad coverage of intent space and personality space. A small live-traffic shadow pipeline that mirrors production conversations against the candidate agent and surfaces regressions the synthetic users miss. A periodic human-rated cohort that audits both. The synthetic-user pipeline is not trying to replace the other two; it is trying to keep them rare enough that you can afford to do them well.
The teams that get this right will ship fewer multi-turn regressions. The teams that don't will keep finding out from users which conversations their agent can't have.
- https://arxiv.org/abs/2406.12045
- https://arxiv.org/pdf/2506.07982
- https://arxiv.org/abs/2601.17087
- https://arxiv.org/html/2511.00222v1
- https://aclanthology.org/2025.emnlp-industry.16/
- https://arxiv.org/html/2510.18170
- https://arxiv.org/html/2507.20152v2
- https://aws.amazon.com/blogs/machine-learning/simulate-realistic-users-to-evaluate-multi-turn-ai-agents-in-strands-evals/
- https://aclanthology.org/2025.findings-emnlp.368.pdf
