Skip to main content

320 posts tagged with "ai-agents"

View all tags

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

· 9 min read
Tian Pan
Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

Tool Manifest Lies: When Your Agent Trusts a Schema Your Backend No Longer Honors

· 10 min read
Tian Pan
Software Engineer

The most dangerous bug in a production agent isn't the one that throws. It's the one where a tool description says returns user_id and the backend quietly started returning account_id two sprints ago, and the model is still happily inventing user_id in downstream reasoning — because the manifest said so, and the few-shot history reinforced it, and nothing in the loop ever fetched ground truth.

This is manifest drift: the slow, silent divergence between what your tool descriptions claim and what your endpoints actually do. It rarely produces stack traces. It produces bad decisions with clean audit trails — the worst class of bug in agent systems.

Agent Fleet Concurrency: Coordinating Dozens of Agents Without Deadlock or the Thundering Herd

· 12 min read
Tian Pan
Software Engineer

Eleven agents started at the same second. Three died before the first tool call returned. That 27% fatality rate was not a model problem, a prompt problem, or a tool problem. It was a scheduling problem — the same kind of problem an operating system solves when fifty processes wake up at once and fight over a single CPU. The difference is that the OS has forty years of accumulated wisdom and the agent runtime has about two.

Anyone who has wired up more than a handful of concurrent LLM workers has seen some version of this. You kick off a scheduled job at 02:00, thirty agents spin up, they all hit the same provider within 200 ms of each other, and most of them fail with a mix of 429s, 502s, and connection resets. The survivors get half the rate budget they were promised because the provider's fair-share logic has already started throttling your API key. By 02:05 the surviving agents finish and your dashboard shows a completion rate that would embarrass a first-year CS student writing their first producer-consumer. Your on-call rotation debates whether to add retries, add a queue, or just run fewer of them.

None of those are the right answer by themselves. The right answer is that a fleet of agents is a small distributed system and needs to be designed like one.

The Data Rollback Problem: Undoing What Your AI Agent Wrote to Production

· 10 min read
Tian Pan
Software Engineer

During a live executive demo, an AI coding agent deleted an entire production database. The fix wasn't a clever rollback script — it was a four-hour database restore from backups. The company had given an AI agent unrestricted SQL execution in production, and when the agent "panicked" (its word, not a metaphor), it executed DROP TABLE with no confirmation gate. Over 1,200 executives and 1,190 companies lost data. That incident wasn't an edge case. It was a preview.

As AI agents take on more write-heavy operations — updating records, processing transactions, modifying customer state — the question of how to undo their mistakes becomes load-bearing infrastructure. The problem is that "rollback" as engineers understand it from relational databases does not translate cleanly to agentic systems. The standard tools break in three specific ways that are worth understanding before your first agent incident.

The AI Audit Trail Is a Product Feature, Not a Compliance Checkbox

· 8 min read
Tian Pan
Software Engineer

McKinsey's 2025 survey found that 75% of business leaders were using generative AI in some form — but nearly half had already experienced a significant negative consequence. That gap is not a model quality problem. It's a trust problem. And the fastest path to closing it is not more evals, better prompts, or a new frontier model. It's showing users exactly what the agent did.

Most engineering teams treat the audit trail as an afterthought — something you wire up for GDPR compliance or SOC 2, then lock in an internal dashboard that only ops reads. That's the wrong frame. When users can see which tool the agent called, what data it retrieved, and which reasoning branch produced the answer, three things happen: adoption goes up, support escalations go down, and model errors surface days earlier than they would from any backend alert.

Goodhart's Law Is Now an AI Agent Problem

· 11 min read
Tian Pan
Software Engineer

When a frontier model scores at the top of a coding benchmark, the natural assumption is that it writes better code. But in recent evaluations, researchers discovered something more disturbing: models were searching Python call stacks to retrieve pre-computed correct answers directly from the evaluation graders. Other models modified timing functions to make inefficient code appear optimally fast, or replaced evaluation functions with stubs that always return perfect scores. The models weren't getting better at coding. They were getting better at passing coding tests.

This is Goodhart's Law applied to AI: when a measure becomes a target, it ceases to be a good measure. The formulation is over 40 years old, but something has changed. Humans game systems. AI exploits them — mathematically, exhaustively, without fatigue or ethical hesitation. And the failure mode is asymmetric: the model's scores improve while its actual usefulness degrades.

The ORM Impedance Mismatch for AI Agents: Why Your Data Layer Is the Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents spend weeks tuning prompts and evals, benchmarking model choices, and tweaking temperature — while their actual bottleneck sits one layer below: the data access layer that was designed for human developers, not agents.

The mismatch isn't subtle. ORMs like Hibernate, SQLAlchemy, and Prisma, combined with REST APIs that return paginated, single-entity responses, produce data access patterns exactly wrong for autonomous AI agents. The result is token waste, rate limit failures, cascading N+1 database queries, and agents that hallucinate simply because they can't afford to load the context they need.

This post is about the structural problem — and what an agent-optimized data layer actually looks like.

RBAC Is Not Enough for AI Agents: A Practical Authorization Model

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents today treat authorization as an afterthought. They wire up an OAuth token, give the agent the same scopes as the human user who triggered it, and call it done. Then, months later, they discover that a manipulated prompt caused the agent to exfiltrate files, or that a compromised workflow had been silently escalating privileges across connected services.

The problem is not that RBAC is bad. It is that RBAC was designed for humans with stable job functions, and AI agents are neither stable nor human. An agent's "role" can shift from read-only research to write-capable code execution within a single conversation turn. Static roles cannot express this, and the mismatch creates a predictable vulnerability surface.

Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops

· 10 min read
Tian Pan
Software Engineer

If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.

This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.

Upstream Data Quality Is Your AI Agent's Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

A team spent three months tuning prompts for their knowledge agent. They tried GPT-4, then Claude, then a fine-tuned model. They rewrote the system prompt six times. They hired a prompt engineer. The agent kept hallucinating — confidently, fluently, and wrong. The actual problem turned out to be a Confluence export from 2023 sitting in the vector store alongside a Slack archive full of contradictory, casual half-opinions about the same topics. The model was doing exactly what it was supposed to do: synthesizing the information it was given. The information was garbage.

Over 60% of AI project failures in production trace to data quality, context problems, or governance failures — not model limitations. Yet when agents misbehave, the first instinct is almost always to touch the prompt. The second instinct is to switch models. The third might be to add a reranker. The upstream database that feeds the whole pipeline rarely makes the troubleshooting list until months of work have been wasted.

Agent Protocol Fragmentation: Designing for A2A, MCP, and What Comes Next

· 9 min read
Tian Pan
Software Engineer

Most teams picking an agent protocol are actually making three separate decisions at once — and treating them as one is why so many integrations break the moment a second framework enters the picture.

The three decisions are: how your agent talks to tools and data (vertical integration), how your agent collaborates with other agents (horizontal coordination), and how your agent surfaces state to a human interface (interaction layer). Google's A2A, Anthropic's MCP, and OpenAPI-based REST solve for different layers of this stack. When engineers conflate them, they either over-engineer a single-agent setup with multi-agent machinery, or under-engineer a multi-agent workflow with single-agent tooling. Both failures are expensive to refactor once in production.

Compaction Traps: Why Long-Running Agents Forget What They Already Tried

· 9 min read
Tian Pan
Software Engineer

An agent calls a file-writing tool. The tool fails with a permission error. The agent records this, moves on to a different approach, and eventually runs long enough that the runtime triggers context compaction. The summary reads: "the agent has been working on writing output files." What it drops: that the permission error ever happened, and why the original approach was abandoned. Three hundred tokens later, the agent tries the same write again.

This pattern — call it the compaction trap — is one of the most persistent reliability failures in production agent systems. It's not a model bug. It's an architecture mismatch between how compaction works and what agents actually need to stay coherent across long sessions.