2 posts tagged with "multi-turn"

Intent Drift in Long Conversations: Why Your Agent's Goal Representation Goes Stale

May 4, 2026 · 9 min read

Software Engineer

Most conversations about context windows focus on what the model can hold. The harder problem is what the model does with what it holds — specifically, how it tracks the evolving goal of the person it's talking to.

Intent isn't static. Users start vague, refine iteratively, contradict themselves, digress, and revise. What they need at message 40 is not necessarily what they expressed at message 2. An agent that treats context as a flat append log will accumulate all of that — and still get the current intent wrong.

Synthetic Users for Multi-Turn Agent Eval: When Your Test Fixture Has To Push Back

April 27, 2026 · 9 min read

Tian Pan

Software Engineer

Single-turn evals are great at one thing: ranking models on tasks where the user types once and walks away. They are useless for the failure modes you actually ship with. The agent that loses track of the user's goal by turn three. The agent that capitulates under polite repetition ("are you sure? could you check again?") and reverses a correct answer. The agent that asks the same clarifying question on turn four that it already asked on turn two, because it can't read its own history. None of these show up in a benchmark where the conversation ends after one exchange.

You can run real-user eval, but it costs hundreds of hours of human review per release and surfaces problems three weeks after they shipped. Or you can build LLM-driven synthetic users — bots with personas, goals, patience, and abandonment thresholds — and run thousands of conversations against a candidate agent every night. This is the approach behind τ-bench, AgentChangeBench, and most production-grade conversational eval setups in 2025–2026. It works, until it doesn't, and the ways it stops working tell you more about your eval pipeline than they do about the synthetic user.

About Tian Pan