Skip to main content

161 posts tagged with "agents"

View all tags

Graceful Tool-Call Failure: The Error Contract Your Agent UI Is Missing

· 11 min read
Tian Pan
Software Engineer

Every agent demo you've ever seen ended with a clean result. The tool call returned exactly the data the model expected, the response arrived in well under two seconds, and the final answer was crisp and correct. That's the demo. Production is something else.

In production, tools time out. APIs return 403s because a service account was rotated last Tuesday. Third-party enrichment endpoints return a 200 with a body that says {"status": "degraded", "data": null}. OAuth tokens expire at 3 AM on a Saturday. These aren't edge cases — they're the normal operating conditions of any agent that talks to the real world. The failure modes are predictable. The problem is that most agent architectures treat them as afterthoughts, and most agent UIs have no vocabulary for communicating them to users at all.

The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think

· 10 min read
Tian Pan
Software Engineer

Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.

The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.

The Cascade Problem: Why Agent Side Effects Explode at Scale

· 12 min read
Tian Pan
Software Engineer

A team ships a document-processing agent. It works flawlessly in development: reads files, extracts data, writes results to a database, sends a confirmation webhook. They run 50 test cases. All pass.

Two weeks after deployment, with a hundred concurrent agent instances running, the database has 40,000 duplicate records, three downstream services have received thousands of spurious webhooks, and a shared configuration file has been half-overwritten by two agents that ran simultaneously.

The agent didn't break. The system broke because no individual agent test ever had to share the world with another agent.

The Agent Specification Gap: Why Your Agents Ignore What You Write

· 12 min read
Tian Pan
Software Engineer

You wrote a careful spec. You described the task, listed the constraints, and gave examples. The agent ran — and did something completely different from what you wanted.

This is the specification gap: the distance between the instructions you write and the task the agent interprets. It's not a model capability problem. It's a specification problem. Research on multi-agent system failures published in 2025 found that specification-related issues account for 41.77% of all failures, and that 79% of production breakdowns trace back to how tasks were specified, not to what models can do.

The majority of teams writing agent specs are committing the same category of mistake: writing instructions the way you'd write an email to a competent colleague, then expecting an autonomous system with no shared context to execute them correctly across thousands of runs.

Designing for Partial Completion: When Your Agent Gets 70% Done and Stops

· 10 min read
Tian Pan
Software Engineer

Every production agent system eventually ships a failure nobody anticipated: the agent that books the flight, fails to find a hotel, and leaves a user with half a confirmed itinerary and no clear way to finish. Not a crash. Not a refusal. Just a stopped agent with real-world side effects and no plan for what comes next.

The standard mental model for agent failure is binary — succeed or abort. Retry logic, exponential backoff, fallback prompts — all of these assume a clean boundary between "task running" and "task done." But real agents fail somewhere in the middle, and when they do, the absence of partial-completion design becomes the bug. You didn't need a smarter model. You needed a task state machine.

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

· 11 min read
Tian Pan
Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.

The Idempotency Problem in Agentic Tool Calling

· 11 min read
Tian Pan
Software Engineer

The scenario plays out the same way every time. Your agent is booking a hotel room, and a network timeout occurs right after the payment API call returns 200 but before the confirmation is stored. The agent framework retries. The payment runs again. The customer is charged twice, support escalates, and someone senior says the AI "hallucinated a double charge" — which is wrong but feels right because nobody wants to say their retry logic was broken from the start.

This isn't an AI problem. It's a distributed systems problem that the AI layer imported wholesale, without the decades of hard-won patterns that distributed systems engineers developed to handle it. Standard agent retry logic assumes operations are idempotent. Most tool calls are not.

Multi-Modal Agents in Production: What Text-Only Evals Never Catch

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.

Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.

The Over-Tooled Agent Problem: Why More Tools Make Your LLM Dumber

· 9 min read
Tian Pan
Software Engineer

When a team at Writer instrumented their RAG-MCP benchmark, they found that baseline tool selection accuracy — with no special handling — was 13.62% when the agent had access to a large set of tools. Not 80%. Not 60%. Thirteen percent. The same agent, with retrieval-augmented tool selection exposing only the most relevant subset, reached 43%. The tools didn't change. The model didn't change. Only the number of tool definitions visible at reasoning time changed.

This is the over-tooled agent problem, and it's quietly wrecking production AI systems at scale.

Prompt Injection Is a Supply Chain Problem, Not an Input Validation Problem

· 9 min read
Tian Pan
Software Engineer

Five carefully crafted documents hidden among a million clean ones can achieve a 90% attack success rate against a production RAG system. Not through zero-days or cryptographic breaks — through plain text that instructs the model to behave differently than its operators intended. If your defense strategy is "sanitize inputs before they reach the LLM," you have already lost.

The framing matters. Teams that treat prompt injection as an input validation problem build perimeter defenses: regex filters, LLM-based classifiers, output scanners. These are useful but insufficient. The real problem is that modern AI systems are compositions of components — retrievers, knowledge bases, tool executors, external APIs — and each component is an ingestion point with its own attack surface. That is the definition of a supply chain vulnerability.

Sandboxing Agents That Can Write Code: Least Privilege Is Not Optional

· 12 min read
Tian Pan
Software Engineer

Most teams ship their first code-executing agent with exactly one security control: API key scoping. They give the agent a GitHub token with repo:read and a shell with access to a working directory, and they call it "sandboxed." This is wrong in ways that become obvious only after an incident.

The threat model for an agent that can write and execute code is categorically different from the threat model for a web server or a CLI tool. The attack surface isn't the protocol boundary anymore — it's everything the agent reads. That includes git commits, documentation pages, API responses, database records, and any file it opens. Any of those inputs can contain a prompt injection that turns your research agent into a data exfiltration pipeline.

Text-to-SQL at Scale: What Nobody Tells You Before Production

· 11 min read
Tian Pan
Software Engineer

Text-to-SQL demos are deceptively easy to build. You paste a schema into a prompt, ask GPT-4 a question, get back a clean SELECT statement, and suddenly your Slack is full of "what if we built this into our data platform?" messages. Then you try to actually ship it. The benchmark says 85% accuracy. Your internal data team reports that about half the answers are wrong. Your security team asks who reviewed the generated queries before they hit production. Nobody has a good answer.

This is the gap between text-to-SQL as a research problem and text-to-SQL as an engineering problem. The research problem is about getting models to produce syntactically valid SQL. The engineering problem is about schema ambiguity, access control, query validation, and the fact that your enterprise database looks nothing like Spider or BIRD.