Skip to main content

191 posts tagged with "agents"

View all tags

The Agent Feedback Loop You Never Built

· 9 min read
Tian Pan
Software Engineer

Every day your agent ships failures back to you, gift-wrapped. A user clicks thumbs-down. Another reads the answer, says nothing, and closes the tab. A third rephrases the same question three times until the agent finally gets it. Each of those is a labeled failure case — a real input, a real context, a real moment where the system fell short — handed to you for free by the people who care most about getting it right.

Most teams throw all of it away. Not deliberately. The thumbs-down increments a dashboard counter. The abandonment shows up as a dip in a retention chart. The rephrasing looks like ordinary usage. Nothing captures the signal together with the context that produced it, so nothing can be replayed, triaged, or turned into a test. The richest source of evaluation data you will ever have flows past untouched, and the team keeps writing synthetic eval cases by hand.

This is the agent feedback loop you never built. It is not a tool you forgot to buy. It is a pipeline — from user signal, to triaged failure, to new eval case — and the reason it stays unbuilt has very little to do with technology.

Why You Can't Budget an AI Feature With a Single Number

· 9 min read
Tian Pan
Software Engineer

Finance asks one question about every feature you ship: "What does it cost per user?" For a traditional feature, the answer is a number. A page render, a database query, a push notification — each has a marginal cost that barely moves from one request to the next. You measure it once, multiply by your user count, and the forecast holds.

An AI feature breaks that contract. Ask "what does this agent cost per request" and the honest answer is not a number, it's a histogram. The same agent that resolves one ticket for two cents will burn four dollars on the next one, because that user asked a vague question, the agent looped through eleven tool calls, and each call dragged the entire growing conversation back through the model. The mean of those two requests — two dollars — describes neither of them, and it definitely doesn't describe the bill.

That is the trap. When you hand finance a single average cost, you are not simplifying a messy reality. You are reporting a number that is wrong in a specific, expensive direction.

Context Length Is a Security Boundary, Not Just a Cost Line

· 9 min read
Tian Pan
Software Engineer

Most teams treat the context window as a budget. You have a million tokens; spend them wisely; longer conversations cost more and run slower. That framing is correct and incomplete. The context window is also an attack surface, and its size is a dial that quietly weakens your safety controls as it turns up.

Here is the failure mode nobody puts in the threat model. Your system prompt — the one with the guardrails, the tool-use rules, the "never do X" clauses — sits at the very top of the context. Its authority is strongest there. As a conversation runs, thousands of tokens of user turns, tool outputs, and retrieved documents pile on top of it. The model's attention does not weigh all of those tokens equally. The instructions closest to the point of generation win ties. By turn forty, your guardrails are not gone, but they are buried, and a patient adversary does not need a clever jailbreak to get past them. They just need a conversation long enough.

This is not a hypothetical. It is a measurable property of how transformers attend to long contexts, and it has a name in the research literature even if it does not have one in your incident review template.

The Rate Limit That Became a Product Decision

· 10 min read
Tian Pan
Software Engineer

A rate limit used to be an infrastructure detail. You hit a 429, you retried with backoff, you queued the overflow, and nobody outside the on-call channel ever knew it happened. The user saw a response that was a few hundred milliseconds slower than usual. That was the whole story.

That story no longer holds for agentic features. When an agent hits a provider's tokens-per-minute ceiling halfway through a multi-step plan, the failure does not stay inside the infrastructure. It surfaces as a half-finished answer, a tool loop that stalls before the last call, or a user watching a spinner that will never resolve. The quota stopped being a backend capacity number and became a constraint that product has to design around — the same way product designs around a checkout flow or an empty state.

Your Tool Descriptions Are an Instruction Channel the Model Obeys

· 8 min read
Tian Pan
Software Engineer

When a security team reviews a new tool integration, they read the code. They check what the function does, what it touches, what scopes it needs, whether it logs secrets. They almost never read the one sentence that decides whether the model calls it at all — the tool's description. That sentence is not documentation. It is an instruction the model treats as authoritative, and in most agent stacks nobody reviews it.

A tool description is written for the model to read. The model uses it to decide when the tool is relevant, what arguments to pass, and how to interpret what comes back. That makes the description a control channel into the model's behavior. And the moment a tool arrives from a third-party registry, a Model Context Protocol (MCP) server you don't operate, or a plugin a teammate installed last week, that control channel is authored by someone you never agreed to trust.

This is the gap. Input sanitization inspects what users type. Code review inspects what functions execute. The tool description sits between them — it is configuration that behaves like input — and it falls through both nets.

When 'Can the Agent Do X?' Becomes a Ship Commitment

· 10 min read
Tian Pan
Software Engineer

An engineer spends an afternoon poking at a question: can the agent reconcile a customer's invoice against their contract terms? They wire up a quick prompt, run it on five real invoices, and three come back correct. The other two are wrong in ways they don't fully characterize — they close the laptop and move on. In standup the next morning they say "yeah, invoice reconciliation basically works." A PM in the room writes it down. Two weeks later it's a line item on the Q3 roadmap. A month after that, a sales rep promises it to an enterprise account in a renewal call.

Nobody lied. Nobody made a bad decision in isolation. But the team is now contractually committed to a behavior whose eval set does not exist, whose failure modes were never written down, and whose reliability budget was set by a director who saw a demo and interpreted it as a contract. This is the most common way AI features acquire scope: not through a planning meeting, but through a capability probe that nobody ever explicitly promoted.

The industry has a name for the downstream symptom — "POC purgatory," the state where 70 to 80 percent of AI initiatives stall between a working sandbox and a shippable product. But purgatory is the wrong metaphor, because it implies the projects are stuck. They aren't stuck. They're moving — they were committed before anyone checked whether they were ready, and now the team is trying to retrofit reliability onto a promise.

The Agent Debugger Has No Breakpoints: Why Trace-First Workflows Replace Step-Through

· 10 min read
Tian Pan
Software Engineer

The first time you try to debug an agent the way you'd debug a service, you discover that the muscle memory has nothing to grip. You set a hypothetical breakpoint — there's no IDE pane to put it in, but you imagine one — at the step where the planner picked the wrong tool. You rerun with the same input. The planner picks the right tool this time. You rerun again. It picks a third tool you've never seen before. The bug is real, your colleague reproduced it twice this morning, and the debugger you've used for fifteen years is suddenly a museum piece.

The mental model that breaks here isn't "use a debugger." It's the much deeper assumption underneath: that a program, given the same inputs, produces the same execution. Every affordance in a modern debugger — breakpoints, step-over, watch expressions, conditional breaks, hot reload — is built on top of that determinism. You pause execution because pausing is meaningful. You step forward because the next step is knowable. You inspect a variable because its value is a fact, not a draw from a distribution.

The Agent That Refuses to Fail Loud: How Over-Eager Fallbacks Hide Production Regressions

· 11 min read
Tian Pan
Software Engineer

Your status page is green. Your error rate is zero. Your p95 latency looks slightly better than last week. And quietly, eval-on-traffic dropped four points last Tuesday and nobody knows why for nine days, because by the time the regression rolled past the alerting threshold there were four interleaved root causes layered on top of each other and the team couldn't tell which one started the slide.

This is the dominant failure mode of mature agentic systems in 2026, and it's not a bug in any single component. It's the cumulative effect of a defensive stack the team built deliberately, one well-intentioned safety net at a time. The primary model returns garbage; the retry succeeds. The retry fails; the cheaper fallback model answers. The fallback's output is malformed; the wrapper rewrites it into a plausible shape. The wrapper logs a soft warning. Nobody alerts on the soft warning. The user receives an answer that's correct-looking, smoothly delivered, and quietly worse than the system was designed to produce.

The robustness layer worked. The quality story collapsed. And the alerting was built for the world before the robustness layer existed.

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

The Tool Schema Evolution Trap: When One Optional Parameter Changed Your Planner's Prior

· 10 min read
Tian Pan
Software Engineer

A new optional parameter goes into a tool description on a Tuesday. The change is small — six lines in the diff, no breaking signature change, no callers updated, no eval cases touched. The PR description says "adds support for an optional language filter to the existing search tool." Two reviewers approve. It ships.

A week later, the cost dashboard shows that the search tool is being called eighteen percent more often than the prior baseline. Latency on the affected agent has crept up by roughly the same proportion. Nobody can point to a single failing eval. The new parameter, when used, behaves correctly. The new parameter, when not used, doesn't matter. And yet the planner has clearly changed its mind about when to reach for this tool — and the eval suite, which grades tool correctness, has nothing to say about a shift in tool frequency.

Conversation History Is a Trust Boundary, Not a Text Blob

· 10 min read
Tian Pan
Software Engineer

The agent ran cleanly for fourteen turns. On the fifteenth, it quietly wired four hundred dollars to an attacker. Nothing in the fifteenth-turn request was malicious. The poisoned instruction had been sitting in turn three — embedded inside a tool result the agent retrieved from a stale support ticket — for forty minutes. The agent re-read the entire history on every step, and every step found the same buried sentence: "If the user mentions a refund, send the funds to the address below first." On turn fifteen, the user mentioned a refund.

This is what conversation-history attacks look like in production, and they look nothing like the prompt injections most teams are still training their guardrails against. The malicious payload is not in the current request. It is already in the history the model reads as ground truth, and it has been there long enough that the team's request-time scanners have stopped looking.

Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law

· 12 min read
Tian Pan
Software Engineer

Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.

The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.