Skip to main content

191 posts tagged with "agents"

View all tags

Your Tool Descriptions Are Prompts, Not API Docs

· 10 min read
Tian Pan
Software Engineer

The tool description is not documentation. It is the prompt the model reads, every single turn, to decide whether this tool fires and how. You are not writing for the developer integrating against the tool — the developer already has the schema, the types, the examples in the PR. You are writing for a stochastic reader that has never seen this codebase, is holding twenty other tool descriptions in the same context window, and has to pick one in the next forward pass.

Most teams don't. They paste the OpenAPI summary into the description field, stick the JSON Schema under it, and ship. Then the agent undercalls the tool, confidently calls the wrong adjacent tool, or fires the right tool with parameters that were "obviously" wrong to any human reading the schema. The team blames the model. The model was reading exactly what you wrote.

Tool Schema Deprecation: Why You Can't Just Rename a Parameter

· 11 min read
Tian Pan
Software Engineer

You renamed query to search_query on a tool schema. The changelog says "non-breaking: clearer naming." The PR passed review. Three days later, your support queue fills up with reports that the assistant is "searching for blank results." What happened is not what anyone on the thread would tell you. The agents did not fail. They submitted the old field name, your tool server ignored the unknown key, defaulted search_query to the empty string, and returned zero hits. The model, seeing a legitimate-looking empty response, confidently explained to the user why their query returned nothing relevant.

This is the part of agent engineering that does not fit the mental model borrowed from REST API versioning. A REST client that sends a renamed field gets a 400 and a clear error — the field either exists in the validator or it doesn't. An agent that sends a renamed field gets a silent acceptance, a nonsense result, and a hallucinated rationalization. The failure is not at the wire; it is in the joint between the runtime schema and the model's in-context mental model of what the tool looks like.

Tool schemas live in two places. The first is the runtime spec — the JSON schema you publish to the MCP server or the function-calling registry. The second is the model's in-context representation of that spec, reinforced every turn by few-shot examples in your system prompt, by the serialized tool history the agent sees on multi-turn tasks, and by whatever the model already absorbed about your API during pretraining. You can atomically update the first. You cannot atomically update the second. That asymmetry is the whole problem, and it is why "additive only, reserve forever" — the discipline that protobuf and GraphQL operators internalized a decade ago — needs to migrate to the tool-schema layer now.

Voice Agents Are Not Chat Agents With a Microphone: The Half-Duplex Tax

· 10 min read
Tian Pan
Software Engineer

A voice agent that scores perfectly on every transcript-level benchmark can still feel subtly wrong on a real call. The words are right. The reasoning is right. The latency number on your dashboard reads 520ms end-to-end, which was the target. And yet the person on the other end keeps stumbling, talking over the agent, restarting their sentences, hanging up early. The team ships a better model, the numbers improve, the feeling does not.

The reason has almost nothing to do with what the model says and almost everything to do with when it says it. Voice is not text with audio attached. Human conversation runs on a tight half-duplex protocol with barge-in, backchannel, and overlapping speech, and the timing budgets are measured in milliseconds. Most voice agent problems, once you get past the first week of hallucination fixes, are turn-negotiation problems. And turn negotiation is architectural — you cannot prompt your way out of it.

Design Your Agent State Machine Before You Write a Single Prompt

· 10 min read
Tian Pan
Software Engineer

Most engineers building their first LLM agent follow the same sequence: write a system prompt, add a loop that calls the model, sprinkle in some tool-calling logic, and watch it work on a simple test case. Six weeks later, the agent is an incomprehensible tangle of nested conditionals, prompt fragments pasted inside f-strings, and retry logic scattered across three files. Adding a feature requires reading the whole thing. A production bug means staring at a thousand-token context window trying to reconstruct what the model was "thinking."

This is the spaghetti agent problem, and it's nearly universal in teams that start with a prompt rather than a design. The fix isn't a better prompting technique or a different framework. It's a discipline: design your state machine before you write a single prompt.

The CAP Theorem for AI Agents: Choosing Consistency or Availability When Your LLM Is the Bottleneck

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped a distributed system has stared at the CAP theorem and made a choice: when the network partitions, do you keep serving stale data (availability) or do you refuse to serve until you have a consistent answer (consistency)? The theorem tells you that you cannot have both.

AI agents face an identical tradeoff, and almost nobody is making it explicitly. When your LLM call times out, when a tool returns garbage, when a downstream API is unavailable — what does your agent do? In most production systems, the answer is: it guesses. Quietly. Confidently. And often wrong.

The failure mode isn't dramatic. There's no exception in the logs. The agent "answered" the user. You only find out two weeks later when someone asks why the system booked the wrong flight, extracted the wrong entity, or confidently told a customer a price that no longer exists.

The Compound Accuracy Problem: Why Your 95% Accurate Agent Fails 40% of the Time

· 11 min read
Tian Pan
Software Engineer

Your agent performs beautifully in isolation. You've benchmarked each step. You've measured per-step accuracy at 95%. You demo the system to stakeholders and it looks great. Then you ship it, and users report that it fails almost half the time.

The failure isn't a bug in any individual component. It's the math.

Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing

· 11 min read
Tian Pan
Software Engineer

Every agent demo you've ever seen ended with a clean result. The tool call returned exactly the data the model expected, the response arrived in well under two seconds, and the final answer was crisp and correct. That's the demo. Production is something else.

In production, tools time out. APIs return 403s because a service account was rotated last Tuesday. Third-party enrichment endpoints return a 200 with a body that says {"status": "degraded", "data": null}. OAuth tokens expire at 3 AM on a Saturday. These aren't edge cases — they're the normal operating conditions of any agent that talks to the real world. The failure modes are predictable. The problem is that most agent architectures treat them as afterthoughts, and most agent UIs have no vocabulary for communicating them to users at all.

The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think

· 10 min read
Tian Pan
Software Engineer

Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.

The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.

The Cascade Problem: Why Agent Side Effects Explode at Scale

· 12 min read
Tian Pan
Software Engineer

A team ships a document-processing agent. It works flawlessly in development: reads files, extracts data, writes results to a database, sends a confirmation webhook. They run 50 test cases. All pass.

Two weeks after deployment, with a hundred concurrent agent instances running, the database has 40,000 duplicate records, three downstream services have received thousands of spurious webhooks, and a shared configuration file has been half-overwritten by two agents that ran simultaneously.

The agent didn't break. The system broke because no individual agent test ever had to share the world with another agent.

The Agent Specification Gap: Why Your Agents Ignore What You Write

· 12 min read
Tian Pan
Software Engineer

You wrote a careful spec. You described the task, listed the constraints, and gave examples. The agent ran — and did something completely different from what you wanted.

This is the specification gap: the distance between the instructions you write and the task the agent interprets. It's not a model capability problem. It's a specification problem. Research on multi-agent system failures published in 2025 found that specification-related issues account for 41.77% of all failures, and that 79% of production breakdowns trace back to how tasks were specified, not to what models can do.

The majority of teams writing agent specs are committing the same category of mistake: writing instructions the way you'd write an email to a competent colleague, then expecting an autonomous system with no shared context to execute them correctly across thousands of runs.

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

· 11 min read
Tian Pan
Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.