Skip to main content

Tool Call Convergence: Designing Agents That Know When to Stop

· 10 min read
Tian Pan
Software Engineer

A LangChain analyzer/verifier agent pair ran for 264 hours straight and racked up $47,000 in API costs. It produced nothing useful. The verifier kept rejecting the analyzer's output without saying what was wrong. The analyzer defaulted to trying again. No one had written a stopping criterion. The loop ran until someone noticed the invoice.

This is the failure mode that doesn't make it into architecture diagrams: agents that know how to call tools but don't know when to stop. The canonical agent loop is a while True that asks the model "should I call a tool?" — but that question has no built-in answer for "I've seen enough." Without convergence logic, you're not building an agent. You're building an expensive polling function.

The Architecture of the Problem

Every agent loop has the same shape: observe, decide, act, repeat. The model reads the current context, decides whether to call a tool, executes it, appends the result to context, and does it again. The problem is structural. Each iteration re-reads the entire context before deciding whether to make another call. Context grows quadratically — original prompt, then original plus first response, then all of that plus the second response. Input tokens dominate cost, not output. By iteration ten of a retrieval loop, you're paying to re-read a lot of context you already processed.

Stochastic trajectories compound this. The same agent on the same task can vary token consumption by up to 30x across runs. Agents hit p95 latencies that are ten times their median because a small percentage of queries spiral into retrieval thrash — the pattern where search queries alternate between broadening and narrowing without new evidence accumulating. If your median retrieval iterations per query is two but your p95 is eight, you have a thrash problem. It's just not visible unless you're looking at the right histogram.

The model itself can't reliably self-diagnose whether it has enough information. It was trained to be helpful, which means it defaults to doing more. Asking the model "do you have enough yet?" will usually get a "let me check one more thing."

What Convergence Signals Actually Exist

Information sufficiency is not a single metric — it's a family of signals that each catch different failure modes.

Diminishing returns on retrieval. The most practical heuristic is marginal insight: after each retrieval iteration, ask what new evidence was gained. The Microsoft 365 Researcher agent implements this explicitly — it monitors "What new evidence was gained in this step?" and concludes when the marginal gain falls below a threshold. This requires defining what "new evidence" means for your task, which is task-specific work, but the pattern generalizes. If the latest retrieval results have high semantic overlap with results from two iterations ago, you're fetching the same knowledge from different angles.

Coverage thresholds. For retrieval-heavy agents, production teams have converged on a three-pass cap: beyond three retrieval cycles, the probability of finding genuinely new information approaches zero for most query types. The right number is empirical for your domain, but the point is to pick a number and enforce it rather than leaving the loop open.

Confidence convergence. Self-consistency is a practical proxy: if the model gives the same answer when prompted multiple times with the same context, confidence has converged. This is cheap to implement — run the final synthesis step twice with different random seeds and compare outputs semantically. If they agree, you're done. If they diverge significantly, another retrieval pass might help.

No-progress detection. The simplest and most underused signal: if the agent calls the same tool with the same (or semantically identical) arguments and gets the same output, it's stuck. First detection warrants a self-correction prompt ("you've tried this before without new results — what's a different approach?"). Second detection within the same session warrants a hard stop. This catches the feedback loop failure mode — the $47,000 case was fundamentally a no-progress loop that no one was watching.

Semantic drift. Recent research formalizes agent trajectories as dynamical systems in embedding space. A well-converging agent shows a contractive pattern: high similarity between successive states, bounded global drift from the original intent. An oscillating agent alternates between two states. A diverging agent shows each iteration semantically distant from the last with no cluster structure. You can measure this with embedding distance between successive context snapshots, though few teams do.

The Termination Contract

Every tool call should satisfy a termination contract: a machine-readable exit condition that the framework enforces regardless of what the model decides. The frameworks have different approaches.

LangChain's max_iterations (default 15) is the floor, not the ceiling. It's a safety net against runaway loops, not a convergence strategy. When it triggers, early_stopping_method="generate" gives the model one final synthesis pass with a "what have you found so far?" prompt; early_stopping_method="force" halts immediately. "Generate" is usually right for user-facing applications because it produces something coherent even at the hard limit.

AutoGen has the richest termination vocabulary: MaxMessageTermination, TokenUsageTermination, TimeoutTermination, TextMentionTermination (stop when the agent says "TERMINATE"), and FunctionalTermination for arbitrary boolean expressions. These are composable with AND/OR logic. The pattern that works in practice is layered termination: a token budget (hard cost limit), a message count (prevent context explosion), and a text signal (let the agent declare completion), combined so any single trigger stops the loop.

The Vercel AI SDK's stopWhen with stepCountIs(20) is blunt but at least defaults to a limit rather than none.

Cost-based termination is underused in open-source frameworks but critical in production. The pattern is two thresholds: a soft limit (triggers an alert to engineering) and a hard limit (forces "final answer" mode, synthesizing with whatever context exists). 0.50and0.50 and 2.00 per session are numbers that production teams have found workable, but your thresholds depend on your margin model.

The critical gap across all frameworks: none have built-in "marginal insight" heuristics. You have to implement diminishing-return detection yourself. It requires a task-specific utility function — what counts as new evidence in a customer support agent differs from what counts in a code review agent.

Instrumenting for Production Visibility

The loop problem is invisible without the right metrics. Standard APM tracing tells you what happened; agent observability tells you whether what happened was useful.

The minimum viable instrumentation set:

  • Tool call repetition: count calls to each tool per session, and flag sessions where the same tool is called more than N times with high argument similarity. Cosine similarity or argument hashing both work.
  • Context growth rate: tokens added per iteration. If context is growing faster than the rate at which the agent is citing retrieved content in its reasoning, you're accumulating bloat.
  • Retrieval iterations per query: track the distribution, not the mean. The mean masks tail latency spirals. You want p50, p95, and p99.
  • Cost per step: calculate model input plus output cost after each action. When cost per step exceeds a threshold for a session, route to a "final answer" interrupt.
  • Step-to-completion ratio: for agent tasks where ground truth exists (e.g., "retrieve the answer to this specific question"), track how many steps each session takes versus the minimum theoretically required. High ratios indicate inefficient convergence.

Each tool invocation should emit structured telemetry: tool name, latency, input arguments (sanitized), output data, retry count, and whether the output was novel relative to prior calls. Nest tool spans under parent LLM spans so you can reconstruct the full decision chain for any session.

The goal isn't to debug individual failures — it's to detect systemic patterns. One agent looping is a bug. Ten agents looping with a common tool is a design problem. The p95 histogram makes this visible; individual traces don't.

The Three-Layer Defense

No single mechanism is sufficient. Production systems need defense in depth.

Layer 1: Policy gate. Before the agent runs, define budgets, permissions, and risk rules. Maximum token budget, maximum cost, which tools are accessible, which actions require human confirmation. This layer doesn't require any model intelligence — it's pure policy enforcement.

Layer 2: Bounded planning. Rather than an open-ended while loop, generate a plan with a fixed maximum step count and expected iteration count before executing. A planner step that produces "retrieve information (max 3 passes) → synthesize → output" is structurally bounded in a way that "search until confident" is not. The planner can still use the model for planning; the constraint is that the plan output is a bounded structure, not a recursive intention.

Layer 3: Executor loop with convergence checks. Within each retrieval step, apply the heuristics above — no-progress detection, marginal-insight thresholding, cost tracking. This layer does the runtime monitoring and applies interrupts when convergence signals fire.

The interplay matters. Layer 1 prevents runaway cost. Layer 2 prevents architectural unboundedness. Layer 3 detects when the execution within a bounded plan is still converging inefficiently.

What "Enough Information" Actually Means

The fundamental difficulty is that information sufficiency is not a universal concept — it's task-specific. For a customer support agent answering a billing question, "enough" might mean having the account record and the relevant policy text. For a research synthesis agent, "enough" might mean having three independent sources that agree. For a code review agent, "enough" might mean having analyzed all changed files.

This specificity is why no framework ships convergence heuristics out of the box. The heuristic that works for one domain breaks for another. The engineering discipline is to make sufficiency explicit: define it per agent type, encode it as a termination condition, and test it with representative queries to measure the distribution of steps-to-convergence before deployment.

Agents that feel smart often hide bad convergence behavior behind good final answers. The model gets to the right output eventually, so the loop problem goes unnoticed. You only find it on the invoice. The teams that catch this early are the ones who look at step count distributions, not just output quality.

The Practical Starting Point

If you're deploying a retrieval agent and have no convergence logic today, the minimum viable improvement is three changes:

  1. Cap retrieval iterations at three. Override this for agents where you have data showing three isn't enough.
  2. Add no-progress detection: same tool plus same args within a session triggers a self-correction prompt; two occurrences triggers a hard stop.
  3. Instrument the step count distribution. Run a week of production traffic and look at the p95. If p95 is more than four times p50, you have a convergence problem worth solving.

Everything else — marginal-insight heuristics, semantic drift analysis, confidence convergence — is optimization you can add once you have baseline visibility. The $47,000 loop ran for eleven days. It would have been caught in two hours with a hard step count limit and a cost alert.

Convergence is not a model capability problem. The model can't tell you when it has enough information any more reliably than a junior analyst can tell you when a project is "done." It's a systems problem: you have to build the stopping criteria that the model doesn't have.

References:Let's stay in touch and Follow me for more thoughts and updates