Skip to main content

One post tagged with "multi-turn"

View all tags

Synthetic Users for Multi-Turn Agent Eval: When Your Test Fixture Has To Push Back

· 9 min read
Tian Pan
Software Engineer

Single-turn evals are great at one thing: ranking models on tasks where the user types once and walks away. They are useless for the failure modes you actually ship with. The agent that loses track of the user's goal by turn three. The agent that capitulates under polite repetition ("are you sure? could you check again?") and reverses a correct answer. The agent that asks the same clarifying question on turn four that it already asked on turn two, because it can't read its own history. None of these show up in a benchmark where the conversation ends after one exchange.

You can run real-user eval, but it costs hundreds of hours of human review per release and surfaces problems three weeks after they shipped. Or you can build LLM-driven synthetic users — bots with personas, goals, patience, and abandonment thresholds — and run thousands of conversations against a candidate agent every night. This is the approach behind τ-bench, AgentChangeBench, and most production-grade conversational eval setups in 2025–2026. It works, until it doesn't, and the ways it stops working tell you more about your eval pipeline than they do about the synthetic user.