Tool Calling in Production: The Loop, the Pitfalls, and What Actually Works

November 4, 2025 · 9 min read

Software Engineer

The first time your agent silently retries the same broken tool call three times before giving up, you realize that "just add tools" is not a production strategy. Tool calling unlocks genuine capabilities — external data, side effects, guaranteed-shape outputs — but the agentic loop that makes it work has sharp edges that don't show up in demos.

This post is about those edges: how the loop actually runs, the formatting rules that quietly destroy parallel execution, how to write tool descriptions that make the model choose correctly, and how to handle errors in a way that lets the model recover instead of spiral.

The Contract: Your Code Runs, Not the Model's

The foundational thing to understand about tool calling is that the model never executes anything. It emits a structured request. Your code executes the operation. You return the result. This is not a detail — it is the entire architecture.

When you provide tools in an API request, the model evaluates whether it needs one and, if so, returns a response with stop_reason: "tool_use" alongside a tool_use block containing the tool name and JSON arguments. Your application reads that block, dispatches to the right function, and sends back the result in a tool_result block on the next turn. The model then continues — potentially calling more tools, or producing a final answer.

This creates a client-side loop:

Send request with tools defined
If response stop_reason == "tool_use", execute the requested tools
Return results in a new user message with tool_result blocks
Repeat until stop_reason is "end_turn" or another terminal value

The simplicity here is deceptive. Nearly every production issue with tool calling traces back to something going wrong in this loop: results formatted incorrectly, errors swallowed instead of surfaced, or the loop allowed to run without bounds. Get the loop right, and the rest follows.

Parallel Tool Calls: The Formatting Rule That Breaks Everything

Modern models can request multiple tools in a single turn — for example, fetching the weather in three cities simultaneously rather than one by one. This matters for latency. A sequential chain of four network calls at 200ms each costs 800ms; parallel execution collapses that to 200ms.

But parallel tool execution only stays parallel if you format the results correctly. The rule is strict: all tool results from a single model turn must be returned in a single user message.

If you split them into separate messages, the conversation history looks like a sequential exchange — one request, one result, one request, one result — and the model learns from that history that it should make tool calls one at a time. You don't get an error. The model just quietly becomes less efficient over time as it adapts to the pattern you've shown it.

The correct format looks like this:

[assistant turn] → [tool_use_1, tool_use_2, tool_use_3]
[user turn]      → [tool_result_1, tool_result_2, tool_result_3]  // single message

The incorrect format:

[assistant turn] → [tool_use_1, tool_use_2]
[user turn]      → [tool_result_1]          // ❌ separate message
[user turn]      → [tool_result_2]          // ❌ separate message

There is also a secondary constraint: within a user message containing tool results, the tool_result blocks must come before any text. Putting text first causes a 400 error. This catches people off guard when they try to add context like "Here are the results:" above the results — that text needs to go after, not before.

If your application drives the agentic loop manually, check your message assembly code against both of these rules. They are the most common source of silent degradation in production tool-calling systems.

Writing Tool Descriptions That Actually Work

The model decides which tool to call — and whether to call one at all — primarily based on your tool descriptions. Schema correctness matters, but it is not the limiting factor. The description is.

A description like "Gets the stock price for a ticker" leaves the model with open questions: What exchanges does this cover? What does it return if the ticker is invalid? Should I use it for historical prices or only current prices? When the model is uncertain, it either guesses wrong or avoids the tool entirely.

A description that works fills in those questions explicitly:

Retrieves the current stock price for a given ticker symbol. The ticker must be a valid symbol for a publicly traded company on a major US stock exchange (NYSE or NASDAQ). Returns the latest trade price in USD. Use this when the user asks about the current or most recent price of a specific stock. It will not return historical prices, options data, or company fundamentals.

That last sentence — what the tool does not return — is often the most important part. It tells the model when not to call the tool, which prevents it from calling the wrong tool for a related but different task.

A few additional description principles that reduce misuse in practice:

Describe edge behavior. If the tool returns an empty result for unknown inputs rather than an error, say so. The model needs to know what "no results" looks like.
Use namespacing for related tools. If you have multiple tools spanning different services, prefix with the service name: github_list_prs, slack_send_message, db_query_users. This makes tool selection unambiguous as your tool set grows.
Consolidate related operations. Rather than creating create_pr, review_pr, and merge_pr as separate tools, consider a single github_pr tool with an action parameter. Fewer tools reduce selection ambiguity and make the surface area easier to navigate.

For tools with complex inputs — nested objects, optional parameters that change behavior significantly, format-sensitive strings — add concrete examples. Example inputs are included in the prompt alongside the schema and give the model a pattern to follow rather than having to infer the shape from the schema alone.

Error Handling: Give the Model Enough Context to Recover

When a tool fails, you have two choices: return the error and let the model adapt, or swallow it and leave the model confused. The right choice is almost always to return the error.

The mechanism is the is_error: true flag on the tool_result block. Set this when your tool threw an exception, hit a rate limit, received an upstream HTTP error, or otherwise failed to produce a valid result. The model reads the error and adjusts its behavior — retrying with different parameters, trying a different tool, or surfacing the failure to the user.

What you put in the error message determines whether the model can recover. Generic messages like "failed" give the model nothing to work with. Instructive messages do:

"Rate limit exceeded. Retry after 60 seconds." — the model knows to wait, not try again immediately
"Location not found. Try a more specific city name or include the country." — the model knows to refine the input
"Database connection timed out after 5s. The query was: SELECT * FROM orders WHERE user_id = 123." — gives both the failure mode and enough context to retry

When tool inputs are invalid — missing required parameters, wrong types — you can return an is_error: true result with the validation failure, and the model will typically retry 2-3 times with corrections before giving up. If you want to eliminate invalid calls entirely, use strict: true on your tool definitions. Strict mode enforces your JSON schema exactly, so the model either produces a valid call or doesn't call the tool at all. No malformed inputs reach your application.

When Not to Use Tools

Every tool call is at least one additional API round trip. For a tool that makes a network request, you're stacking latency: the model turn, your network call, and another model turn for the response. If the task is lightweight, the overhead can exceed the work.

Tools don't fit when:

The model can answer from training. Summarization, translation, general-knowledge questions. Adding a tool here adds latency with no benefit.
The interaction is one-shot Q&A with no side effects. If there's nothing to fetch or execute, there's nothing for a tool to do.
You're using tools to extract structure from model output. If you find yourself parsing the model's free-form response to extract a decision, that decision should have been a tool call — but the solution is to restructure, not to add a parsing tool on top.

The tell that you're overusing tools: you have a tool named something like provide_answer or return_result that does nothing except receive structured output from the model. This pattern exists, and it solves a real need (guaranteed output shape), but it should be replaced by structured output mode or tool_choice: {"type": "tool", "name": "..."} with a purpose-built schema. Don't create a fake tool to get a JSON response — use the API features designed for that.

The Loop Needs Bounds

One last production concern: the agentic loop must terminate. A model that keeps calling tools — hitting errors, retrying, getting different errors — can run indefinitely if your application doesn't stop it.

Set a maximum iteration count before you start. Five to ten turns is reasonable for most workflows; complex research or coding tasks might warrant more. When the limit is hit, return what you have or surface the failure explicitly. Running without bounds risks unexpected API costs, downstream rate limits, and stuck sessions that hold resources without resolving.

Timeouts on individual tool calls are equally important. An external API that hangs for 30 seconds doesn't just slow one request — it ties up the conversation state and any UI waiting on it. Set per-call timeouts, return the error, and let the model decide whether to retry or continue without that data.

Tool calling is one of the highest-leverage features you can add to an LLM application. On complex benchmarks, even basic tools produce outsized capability gains. But that leverage only materializes when the loop is sound: results formatted correctly, descriptions written precisely, errors returned with enough context to recover from, and bounds in place to prevent runaway execution. Treat the agentic loop like the async state machine it actually is, and most of the sharp edges disappear.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Tool Calling in Production: The Loop, the Pitfalls, and What Actually Works

The Contract: Your Code Runs, Not the Model's

Parallel Tool Calls: The Formatting Rule That Breaks Everything

Writing Tool Descriptions That Actually Work

Error Handling: Give the Model Enough Context to Recover

When Not to Use Tools

The Loop Needs Bounds

Recommended Reading

About Tian Pan

The Contract: Your Code Runs, Not the Model's​

Parallel Tool Calls: The Formatting Rule That Breaks Everything​

Writing Tool Descriptions That Actually Work​

Error Handling: Give the Model Enough Context to Recover​

When Not to Use Tools​

The Loop Needs Bounds​

Recommended Reading

About Tian Pan

The Contract: Your Code Runs, Not the Model's

Parallel Tool Calls: The Formatting Rule That Breaks Everything

Writing Tool Descriptions That Actually Work

Error Handling: Give the Model Enough Context to Recover

When Not to Use Tools

The Loop Needs Bounds