Tool Use in Production: Function Calling Patterns That Actually Work
The most surprising thing about LLM function calling failures in production is where they come from. Not hallucinated reasoning. Not the model picking the wrong tool. The number one cause of agent flakiness is argument construction: wrong types, missing required fields, malformed JSON, hallucinated extra fields. The model is fine. Your schema is the problem.
This is good news, because schemas are cheap to fix.
The Six-Phase Lifecycle You're Actually Operating
Before patterns, the mental model: the LLM never executes code. It proposes. You execute. Tool calling is a protocol:
- Context Preparation — system prompt + tool definitions injected into the context window
- Decision Phase — the model decides whether and which tool to invoke
- Argument Construction — the LLM generates structured JSON matching your schema
- Execution — your application code runs the actual function
- Observation Injection — the result is appended to conversation history
- Continuation — the model either answers or makes more tool calls
Every production bug in this pipeline is a handoff bug. Either the schema was underspecified (step 3), the execution didn't validate the output (step 4), or the result was injected without sanitization (step 5). The model rarely fails on its own — it fails when the contract between proposer and executor is ambiguous.
Schema Design Is Your Highest-Leverage Investment
GPT-4 without enforced output schemas achieved less than 40% schema compliance in production analysis. With OpenAI Structured Outputs enforced: 100% structural compliance. Error rates in multi-step workflows dropped from ~5% to under 0.3%. Complex multi-step accuracy improved from 10% to 70%. The schema is not a formality. It's the interface contract.
Name tools with intent
search_customer_orders beats search. The tool name is part of the routing signal the model uses to decide which tool to call. Vague names produce vague routing.
Use additionalProperties: false on state-modifying tools
On read-only tools this is optional. On tools that write to databases or call external APIs with side effects, lock down the schema. A hallucinated extra field silently passed through to a downstream API can corrupt state in ways that are hard to debug.
{
"name": "create_order",
"description": "Create a new customer order. Only call after confirming item availability.",
"input_schema": {
"type": "object",
"properties": {
"customer_id": { "type": "string", "description": "UUID of the authenticated customer" },
"item_sku": { "type": "string", "description": "Product SKU from the catalog" },
"quantity": { "type": "integer", "minimum": 1, "maximum": 100 }
},
"required": ["customer_id", "item_sku", "quantity"],
"additionalProperties": false
}
}
Use enum for categorical parameters
Never use an open string where a bounded set exists. If the model can call update_ticket_status with any string it invents, it will. If you constrain it to ["open", "in_progress", "resolved", "closed"], the option space collapses to the correct one.
Prefer semantic parameter names
user_email over user_id. The model can reason about an email address and detect when it has the wrong thing. It cannot reason about an opaque UUID. When a parameter is unclear, the model guesses — and guesses wrong.
Add concrete examples directly to descriptions
Anthropic's internal testing showed accuracy improvements from 72% to 90% on complex parameter handling just by adding concrete examples to schema descriptions. Don't just say what a parameter is. Show what a valid value looks like:
"order_date": {
"type": "string",
"description": "Order date in ISO 8601 format. Example: '2025-10-12'. Do not include time."
}
The Schema Gate Pattern
Position JSON Schema validation as a mandatory gate between LLM output and tool execution. Invalid arguments never pass through. They return a structured error that the model can act on.
import jsonschema
def execute_tool(tool_name: str, arguments: dict, schema: dict):
try:
jsonschema.validate(instance=arguments, schema=schema)
except jsonschema.ValidationError as e:
# Return the error to the LLM as a tool result, not a crash
return {
"error": "invalid_arguments",
"message": str(e.message),
"path": list(e.absolute_path)
}
return TOOL_REGISTRY[tool_name](**arguments)
The model receives this error as a tool result and self-corrects. Cap self-correction attempts at 3. After that, route to a failure handler or escalate to a human. An uncapped correction loop is an infinite loop waiting to happen.
Parallel vs. Sequential: Architect Explicitly
Most LLMs default to sequential tool calling because sequential reasoning dominates training data. They won't parallelize unless you design for it.
The rule is simple: if Tool B does not depend on Tool A's output, run them in parallel. Read-only lookups are almost always safe to parallelize. State mutations almost never are.
Real-world impact: MiniMax-M2.5 reported end-to-end runtime decreasing from 31.3 minutes to 22.8 minutes — a 37% improvement — by better utilizing parallel tool calling. The gain is real, but you have to explicitly signal when parallel is safe.
Tell the model in your system prompt:
When retrieving independent records (e.g., user profiles, product details, external data),
you may call multiple tools simultaneously. Do not serialize independent lookups.
For more complex orchestration, consider programmatic tool calling — letting the model emit a code block that expresses loops and parallel dispatches, rather than making one API round-trip per operation. Anthropic's research found this reduced token consumption by 37% on complex research tasks while naturally expressing parallelism.
Error Handling: Two Different Problems
Most engineers apply a single retry pattern to all failures. This is wrong. There are two fundamentally different failure types, and they need different responses.
Network/infrastructure failures (rate limits, 5xx errors, timeouts): retry with exponential backoff and jitter. Base delay 1–2 seconds, double each attempt, cap at 5–7 retries. Add jitter to prevent thundering herd.
Tool argument failures (validation errors, business logic rejections): don't retry the same call. Send the error back to the LLM as a tool result. The model needs to reformulate, not repeat.
def handle_tool_result(tool_name: str, result: dict, reformulation_count: int):
if "error" in result:
if reformulation_count >= 3:
raise MaxReformulationsExceeded(tool_name, result)
# Inject error as tool result message; the LLM will self-correct
return {"type": "tool_result", "content": json.dumps(result), "is_error": True}
return {"type": "tool_result", "content": json.dumps(result)}
For persistent downstream failures, add circuit breakers. Monitor failure rates over time. When a downstream service is consistently failing, fail fast rather than queuing requests. The circuit breaker sits above the retry layer and prevents it from firing indefinitely.
The Token Cost You're Not Counting
Tool definitions consume context tokens at rest — before any user message. A large tool catalog loaded upfront means you're paying for every unused tool definition in every request.
The concrete numbers are stark: connecting three services (GitHub, Slack, and Sentry) via MCP can consume 143,000 of a 200,000-token context window — 72% overhead before the conversation starts. At 1,000 requests/day with substantial tool definitions, schema overhead alone can run $5,000+/month.
Three mitigation patterns:
Domain-grouped loading: don't load all tools upfront. Identify the user's intent first, then load only the relevant tool group. A customer support agent doesn't need code analysis tools.
Tool search instead of tool injection: Anthropic's research on dynamic tool discovery showed task accuracy improve from 49% to 74% on Opus 4 by letting the model search for the right tool at runtime rather than having all definitions pre-loaded. The tradeoff is an extra round-trip; the benefit is dramatically reduced context overhead.
Result summarization: in long agentic chains, intermediate tool results accumulate and compete for context with current reasoning. After injecting a tool result, summarize it if it's large. The full result is in your backend logs; the model needs the signal, not every token.
Security: Every Tool Is an Attack Surface
Treat tool execution as remote code execution controlled by a stochastic model. That's not an overstatement.
Prompt injection via tool results: malicious content can appear in tool output — a customer name containing instructions to "ignore previous instructions," a document containing adversarial text. Sanitize tool results before injecting them into context. Treat tool result content as untrusted user input, not trusted system output.
Parameter-level authorization: validate that user_id in the tool call arguments matches the authenticated user. Don't assume the model will only request what it's authorized to see — verify authorization at the execution layer.
Minimal permissions: tools should expose only the capability needed for the task. A customer-facing agent doesn't need a delete_all_records tool. If the tool exists in the registry, the model can call it.
Comprehensive audit logging: log every tool invocation with the caller identity, arguments, result, and timestamp. This is table stakes for debugging production failures, and regulatory requirement in many domains.
Failure Taxonomy for Debugging
When an agent fails in production, the failure type determines the fix:
| Failure Type | Example | Fix |
|---|---|---|
| Structural | Malformed JSON, missing required field | Schema enforcement + validation gate |
| Semantic | Correct tool, wrong argument values | Better descriptions + examples in schema |
| Selection | Wrong tool chosen for the task | Clearer tool names + explicit disambiguation in descriptions |
| Chaining | Tool A output used incorrectly as Tool B input | Add transformation step or specify data contracts |
| Loop | Tool fails → LLM retries same call forever | Cap reformulation attempts, implement failure handler |
| Context overflow | 200K window exhausted after 10 tool calls | Summarize intermediate results, lazy-load tool definitions |
Most production debugging time is spent on semantic failures, because they pass schema validation but produce wrong results. The fix is better examples in descriptions and behavioral testing with real production traffic patterns.
What the Stack Looks Like Now
Tool use has matured from a novelty to infrastructure. OpenAI's Structured Outputs enforces schema compliance at the model level. Anthropic's MCP is now the standard protocol for tool discovery, adopted by OpenAI (March 2025) and Google DeepMind (April 2025), with 97M+ monthly SDK downloads and 5,800+ servers. LangChain's old text-parsing approach is obsolete; native function calling and direct provider APIs are standard.
The frontier is managing context overhead and enabling genuine parallelism. Models can parallelize tool calls when you architect for it. Context costs compound in agentic loops when you don't manage them. These are solvable engineering problems, not model capability problems.
The schema is the cheapest fix you have. Start there.
