Task Completion Goes Green While Users Quietly Suffer
Your agent dashboard says 94% task completion. Leadership is happy. The roadmap gets funded. And yet support tickets are climbing, power users have gone quiet, and the one engineer who actually watches traces keeps muttering that something is wrong. Both things are true at once. The agent is completing tasks. It is also taking twelve minutes and four thousand tokens to do a two-step job, backtracking three times, and asking the user to confirm a fact it could have inferred from the first message.
Task completion is a binary that hides a distribution. "The agent finished" tells you nothing about the path it took to finish, and the path is most of what users actually experience. A completion-rate dashboard is structurally incapable of seeing a slow, expensive, annoying agent. It will stay green right up until users churn.
This is not a measurement gap you can patch with a better prompt. It is a category error in what you chose to measure. Completion is the easiest thing to instrument and the least of what people are paying for.
Completion Is the Terminal State, Not the Trajectory
When teams first instrument an agent, they ask the obvious question: did it complete the task? That question produces a clean number, and a clean number is easy to put on a slide. The trouble is that an agent reaches a terminal state through a sequence of decisions, and the terminal state erases the sequence.
Two runs both end in "task completed." Run A: the agent read the request, called one tool, returned the answer in eight seconds. Run B: the agent called a search tool, summarized, searched again for the same information, asked the user a clarifying question, got an answer it then ignored, called the search tool a third time, and returned the same final answer ninety seconds later. Your completion metric scores these identically. Your users do not.
Researchers studying agentic systems have a name for Run B: a silent failure — a correct output produced through an incorrect or wasteful process. The output passes inspection. The process is rotten. And because the only thing you graded was the output, the rot compounds invisibly across thousands of runs.
The fix is to grade the trajectory, not just the terminal state. The trajectory is the ordered list of reasoning steps, tool calls, and user turns the agent produced on its way to "done." It is fully present in your traces already. You are simply not scoring it.
Four Metrics That See What Completion Cannot
If completion is the wrong number, what are the right ones? Four, each targeting a class of suffering that a binary cannot detect.
Step efficiency, budgeted per task class. Not every task should cost the same. A "look up an order status" task has a natural floor of one or two steps. A "reconcile three systems and draft a summary" task might justifiably take fifteen. So you do not set one global step ceiling — you set a budget per task class and flag runs that blow past it. A 2-step job that took 14 steps is a defect even though it "succeeded." Without the per-class budget, the 14-step run hides inside the average; with it, the run gets flagged the moment it crosses the line.
Path quality. Step count alone is blunt — fourteen productive steps differ from fourteen steps of thrashing. Path quality scores the shape of the trajectory: backtracks (the agent undoing or contradicting an earlier decision), redundant tool calls (the same tool invoked with effectively the same arguments), and dead-end loops (search-summarize-search cycles that indicate weak stopping criteria). The useful part: most of this is computable from the trace structure alone, no LLM judge required. A loop is a loop. A duplicate call is a duplicate call. You can detect both with deterministic code.
User effort. This is the metric teams most consistently skip, and it is the one closest to churn. It counts how many times the human had to intervene to keep the agent on track: clarifications the agent demanded, corrections the user issued, retries, rephrasings. An agent that completes every task but makes the user work for each completion has a great completion rate and a terrible product. Customer-experience research has said for years that effort predicts loyalty better than satisfaction does — the agent world has simply not wired effort into its dashboards yet.
Trajectory-graded eval slices. Your offline eval suite should not only check final answers. Add slices that assert on the path: "this task class must finish in ≤ N steps," "this trajectory must contain zero redundant calls to the pricing tool," "the agent must not ask for a value already present in the input." These convert path quality from a thing you notice in a postmortem into a thing that fails a build.
The Org Seam: Funding Against a Number That Cannot See
Here is where the measurement problem becomes an organizational problem. Leadership funds against the metrics it can see. If the only visible metric is completion rate, then every investment decision — staffing, model budget, the choice to ship a migration — gets made against a number that is structurally blind to cost, latency, and friction.
The agent team knows the agent is slow and expensive. They watch the traces. But "the traces feel bad" does not survive contact with a quarterly review where the headline number is 94% and climbing. The seam is not malice or incompetence; it is that the team measuring quality and the team allocating budget are looking at different artifacts, and only one of those artifacts made it onto the slide.
This gets worse as agents get more autonomous. Anthropic's analysis of agent autonomy in practice found that between late 2025 and early 2026, the 99.9th percentile turn duration nearly doubled, from under 25 minutes to over 45. Longer autonomous runs mean more trajectory per completion — more room for backtracks, more tokens burned, more chances to annoy a user who is no longer watching. The completion metric does not move. The hidden cost per completion does. A team funding against completion alone is, quarter over quarter, increasing its exposure to a failure mode its dashboard was built not to show.
There is also a subtler trap. Completion rate is gameable in the direction of more suffering. An agent that asks more clarifying questions will complete more tasks correctly — and impose more user effort doing it. Optimize the visible metric hard enough and you actively degrade the invisible ones. The number goes up as the product gets worse.
Instrument the Path Before You Trust the Number
The practical move is not to throw out completion rate. It is a real signal — an agent that fails to finish is worse than one that finishes badly. The move is to refuse to let completion stand alone.
A workable structure, drawn from how mature teams now layer their agent evaluation, has three tiers. Outcome: did the task complete, and was the final answer correct. Trajectory: how efficient and well-shaped was the path — step budget, path quality, loop detection. Effort and cost: how much did the human and the wallet pay — intervention count, tokens, wall-clock latency. A run is only "good" when all three tiers are green. A green outcome on top of a red trajectory is precisely the silent failure you are trying to catch.
Three concrete steps to get there:
- Put a non-completion metric on the same slide as completion. Pick one — user-effort count is the highest-leverage — and give it equal billing. The point is not the metric itself; it is forcing the budget conversation to happen against something other than a binary.
- Add trajectory assertions to your eval suite. Start with the deterministic ones: max steps per task class, zero redundant calls, no asking for known inputs. These cost almost nothing to compute and they fail loudly.
- Review traces, not just scores, on a fixed cadence. Sample completed runs weekly and read the path. The 14-step version of a 2-step task is obvious to a human in ten seconds and invisible to every aggregate you own.
Completion answers a low bar: did the agent give up? Whether it was good — fast, cheap, low-friction, direct — lives entirely in the path. If your dashboard cannot see the path, your dashboard cannot see your product. The agents are getting more autonomous and the runs are getting longer, which means the gap between "finished" and "good" is widening every quarter. Measure the trajectory now, while the number of traces you have to read by hand is still small enough to read.
- https://arxiv.org/html/2512.12791v1
- https://cloud.google.com/blog/topics/developers-practitioners/a-methodical-approach-to-agent-evaluation
- https://www.braintrust.dev/articles/ai-agent-evaluation-framework
- https://aws.amazon.com/blogs/machine-learning/evaluating-ai-agents-real-world-lessons-from-building-agentic-systems-at-amazon/
- https://www.snowflake.com/en/engineering-blog/ai-agent-evaluation-gpa-framework/
- https://www.anthropic.com/research/measuring-agent-autonomy
- https://runcycles.io/blog/ai-agent-silent-failures-why-200-ok-is-the-most-dangerous-response
- https://arize.com/blog/common-ai-agent-failures/
- https://galileo.ai/blog/ai-agent-metrics
