Skip to main content

2 posts tagged with "multi-turn"

View all tags

Intent Drift in Long Conversations: Why Your Agent's Goal Representation Goes Stale

· 9 min read
Tian Pan
Software Engineer

Most conversations about context windows focus on what the model can hold. The harder problem is what the model does with what it holds — specifically, how it tracks the evolving goal of the person it's talking to.

Intent isn't static. Users start vague, refine iteratively, contradict themselves, digress, and revise. What they need at message 40 is not necessarily what they expressed at message 2. An agent that treats context as a flat append log will accumulate all of that — and still get the current intent wrong.

Synthetic Users for Multi-Turn Agent Eval: When Your Test Fixture Has To Push Back

· 9 min read
Tian Pan
Software Engineer

Single-turn evals are great at one thing: ranking models on tasks where the user types once and walks away. They are useless for the failure modes you actually ship with. The agent that loses track of the user's goal by turn three. The agent that capitulates under polite repetition ("are you sure? could you check again?") and reverses a correct answer. The agent that asks the same clarifying question on turn four that it already asked on turn two, because it can't read its own history. None of these show up in a benchmark where the conversation ends after one exchange.

You can run real-user eval, but it costs hundreds of hours of human review per release and surfaces problems three weeks after they shipped. Or you can build LLM-driven synthetic users — bots with personas, goals, patience, and abandonment thresholds — and run thousands of conversations against a candidate agent every night. This is the approach behind τ-bench, AgentChangeBench, and most production-grade conversational eval setups in 2025–2026. It works, until it doesn't, and the ways it stops working tell you more about your eval pipeline than they do about the synthetic user.