Skip to main content

311 posts tagged with "ai-agents"

View all tags

Halted Is Not a Status: Why Agents Need a Typed Terminal-Reason Protocol

· 10 min read
Tian Pan
Software Engineer

Open the dashboard for an agent fleet and you will see a clean number: completion rate, 94%. Below it, a list of runs, each tagged with one of two states — running, or not running. The 6% that are "not running" all look identical. Some of them finished the task perfectly. Some of them hit a step limit two actions short of done. Some of them caught a tool error and gave up. Some of them decided the task was impossible — correctly. And some of them simply lost the thread and stopped emitting tokens.

Your monitoring cannot tell these apart. It knows the process is no longer running. It does not know why, and "why" is the only thing that matters when you are deciding whether to page someone.

The Undo Button Your Agent Assumes Exists

· 9 min read
Tian Pan
Software Engineer

Watch an agent reason through a multi-step task and you will notice something familiar: it plans the way you debug. Try an approach, look at the result, and if it is wrong, back out and try another. The agent talks about its plan as a tree of options it can explore, prune, and revisit. That mental model is correct inside a code sandbox, where every action has an implicit undo. It is dangerously wrong the moment the agent touches the world.

A sent email does not unsend. A charged card does not uncharge without a refund flow, a fee, and a customer who already saw the notification. A deleted row is gone unless someone wired up soft deletes. A posted Slack message has already been read. The agent's planning model has no native concept of the one-way door — the action that, once taken, removes the option of pretending it never happened.

This is not a model intelligence problem. A smarter model still does not know which of your tools is reversible, because reversibility is not a property of the action. It is a property of the system the action lands in. You have to tell it.

The Vector Index Has a Staleness SLO Nobody Set

· 10 min read
Tian Pan
Software Engineer

A user asks your agent what the current price tier is for an enterprise plan. The agent retrieves a chunk, reads it, and answers: "$2,000 per month." Confident, sourced, formatted nicely. The problem is that pricing changed four days ago. The number the agent quoted was true last week. The chunk it retrieved was embedded before the change, and the index has not caught up.

Nobody decided this would happen. There was no design review where someone said "the agent may answer from data up to four days old." There is just a re-indexing job that runs nightly, or weekly, and a content team that edits the source whenever they feel like it, and a gap between those two clocks that nobody measures. That gap is a service level objective. It exists whether or not you wrote it down. The only question is whether you set it on purpose or inherited it by accident.

When No One Answers the Escalation: Human-in-the-Loop Is a Staffing Problem

· 10 min read
Tian Pan
Software Engineer

Every agent architecture diagram has a box labeled "escalate to human." It is drawn with a clean arrow, it satisfies the reviewer, and it makes the system feel safe. What the diagram never shows is the person on the other end of that arrow — whether they exist, whether they are awake, and whether they will answer before the agent's patience runs out.

Human-in-the-loop is sold as a design pattern. In production it behaves like a staffing problem. The pattern assumes a human is standing by; the staffing reality is that escalations do not arrive when humans are available — they arrive on their own schedule. A burst at 2am when an overnight batch job trips a guardrail. A long tail through lunch when half the reviewers are away from their desks. A steady drip that quietly outgrows the two-person team that looked sufficient during the demo, when the agent handled ten requests a day instead of ten thousand.

The gap between "we have an escalation path" and "escalations get answered" is where agentic systems fail in ways no eval catches. The eval measures whether the agent escalates correctly. It never measures whether anyone was there.

Who Gets Paged When the Agent Is Wrong: On-Call for Non-Deterministic Systems

· 9 min read
Tian Pan
Software Engineer

The on-call rotation was built around a promise: failures reproduce. An alert fires, you re-run the request, you watch the bug happen, you find the bad commit, you roll back the deploy. Every part of that loop assumes determinism. The same input produces the same output, and the output is either right or wrong in a way you can stare at.

An agent fleet quietly breaks every link in that chain. The failure happened once, at a sampling temperature you can't replay, on a context window that has since been garbage-collected. There is no bad commit, because the code never changed — the model did, or the retrieved documents did, or the user phrased the request in a way nobody anticipated. You roll back the deploy and the deploy was never the problem.

So the page goes out, an engineer picks it up, and they discover the most uncomfortable fact about operating agents in production: they have been handed a system they cannot single-step, and the runbook in front of them was written for a different kind of machine.

Who Pays for the Model's Mistake: Designing Liability Into Agent Products

· 9 min read
Tian Pan
Software Engineer

An agent books the wrong flight. It sends an apology email to the wrong customer. It writes a database migration that drops a column three services still read from. In each case the model produced a plausible-looking action, executed it, and moved on. And in each case somebody absorbed a real cost — a rebooking fee, a damaged relationship, an incident bridge at 2 a.m.

Here is the uncomfortable part: most AI products have no answer for who that somebody is. The question never comes up in the design review. It surfaces later, one ticket at a time, in a support queue where an agent improvised a $40 credit because the customer sounded angry and the rep had no policy to point at. Multiply that by a few thousand tickets a month and the unit economics quietly rot — not from a dramatic failure, but from a slow leak nobody scoped.

"The model made a mistake" is not a support escalation. It is a billing event. And the products that survive the agentic era will be the ones that designed for that event before the first angry ticket, not the ones that improvised refunds by vibes until the gross margin went negative.

Agent Memory Is a Cache With No Invalidation Policy

· 9 min read
Tian Pan
Software Engineer

Every agent framework now ships "long-term memory" as a headline feature, and every team adopts it as an unambiguous good. The agent remembers the user's preferences, prior decisions, project context, and the corrections it was given last week, so each session starts warmer than the last. The demo is irresistible: a user says "set up the project the way I like it" and the agent just does it. Nobody asks the obvious question, because the framing of the feature actively discourages it.

The question is: when does any of that stop being true?

A memory store is a cache. It holds facts about a world that does not hold still. The agent recorded "the user prefers Postgres" eight months ago, and the team has since migrated to a different database. The agent remembers "the user is on the growth team," and the user changed roles in March. The agent stored a tidy summarized conclusion from a conversation whose premises were corrected two messages later. And the memory layer surfaces all of it with exactly the same confident freshness as a fact written this morning. We have spent fifty years learning that a cache without an invalidation policy is a correctness bug. Then we built agent memory and shipped it without one.

Latency-Aware Tool Selection: When 'Good Enough Now' Beats 'Best Available Later'

· 10 min read
Tian Pan
Software Engineer

The tool description in your agent's system prompt is a six-month-old eval artifact. It says search_pricing returns "fresh inventory data with structured pricing" and the planner believes it, because nothing in the prompt has updated since the day the description was tuned. The actual search_pricing endpoint has been sitting at p95 of 11 seconds for the last forty minutes because the upstream vendor is rate-limiting your account, and the cheaper search_cache tool — which the prompt describes as "may be slightly stale" — would return the same answer in 200ms. The planner picks search_pricing anyway, because the description still reads like it did during eval, and the planner has no signal about what either tool costs to call right now.

This is the structural failure of static tool descriptions. The planner is making routing decisions on a snapshot of a world that has moved on. Tool selection isn't really a capability question — most production agents have two or three tools that overlap heavily in what they can answer — it's a cost-of-waiting question, and the cost of waiting is the thing your prompt template doesn't see.

The Prompt Graph Inside Your Agent: Cross-Prompt Regression Chains Nobody Mapped

· 11 min read
Tian Pan
Software Engineer

A senior engineer ships a four-word edit to the planner prompt — "if uncertain, ask first." The planner's own eval set, which grades whether plans are reasonable, moves up by half a point. They merge. Two weeks later, the verifier's eval shows a three-point pass-rate regression and nobody can repro it. The root cause turns out to be that the planner now asks more clarifying questions, the executor receives shorter task descriptions on the second turn, the verifier's rubric was implicitly tuned against the previous executor's longer outputs, and an edit nobody flagged as risky has shifted three downstream distributions at once.

This is what happens when you treat the prompts inside an agent as a flat folder of files instead of as a graph with edges. The prompts have owners. The edges between them have nobody.

The Second-Draft Agent Pattern: Why Explore-Then-Commit Beats Self-Critique

· 12 min read
Tian Pan
Software Engineer

When a single-pass agent stops being good enough, the default move is to wrap it in a self-critique loop. Generate, critique, revise, repeat. Most teams I talk to assume the eval lift will be roughly linear with the number of revision rounds and stop there. The numbers rarely cooperate. By the third round of self-critique, accuracy is up two or three points and token cost is up 3–4x, and the failure modes that didn't get caught in round one mostly don't get caught in round three either — because the same context that produced the wrong answer is the one being asked to spot the wrongness.

A different shape works better and costs less: let the first pass be wasteful exploration, throw it away, and run a second pass from a clean context with just the lessons learned. Call it the second-draft pattern, or explore-then-commit. The first draft is permitted to be sloppy, to take dead ends, to dump scratch artifacts, to chase hypotheses that turn out to be wrong. The second draft is constrained — it gets the distilled findings and produces a clean execution. On the kinds of tasks where self-critique is tempting (multi-step reasoning, code that touches several files, research syntheses), this two-pass shape often beats n-of-k self-critique on both quality and cost.

Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers

· 12 min read
Tian Pan
Software Engineer

Every team that ships agents to production eventually wakes up to the same kind of incident. An agent enters a state it cannot exit. It re-calls the same tool with cosmetically different arguments for six hours. It oscillates between two plans whose preconditions reject each other. It retries a transient 429 every two hundred milliseconds until morning. It generates a million-token plan it never executes. By the time anyone notices, the token bill is four figures, the downstream API is rate-limited, the customer's session has timed out twelve times, and the on-call engineer is being paged by three different alerts about the same root cause.

The first fix every team reaches for is a step-count budget. Cap the agent at twenty iterations. Cap it at fifty. Pick a number and ship. The step budget makes the incident reports stop, but it does not make the underlying problem go away — and once you understand the mechanism, you can see why a step budget is the agent equivalent of a household fuse: it blows after the damage has been done, the fuse box itself is now a maintenance burden, and the next time something melts, your reflex is to swap in a higher-rated fuse rather than ask what is actually shorting.

Agent Memory Is a Compliance Surface: The Records-Management System You Didn't Sign Up to Build

· 12 min read
Tian Pan
Software Engineer

The first compliance escalation against your agent memory layer almost never arrives as a regulator's letter. It arrives as a Jira ticket from your enterprise sales engineer that says "the customer's privacy team is blocking the contract — they want to know what 'forget my user' actually means in your system, and they want a written answer by Friday." That ticket lands six to twelve months after the memory layer shipped, and the engineering team that built it discovers, in the time it takes to read the question, that they accidentally built a records-management system without any of the primitives a records-management system is supposed to have.

This is the structural problem with long-term memory in agentic products. The team building it optimizes for the things memory is sold to do — retrieval quality, latency, storage cost, the felt-personalization that makes the assistant feel like it knows the user. Nobody in the design review prices the parallel system being built at the same time: a per-user, per-tenant, multi-region data store with retention obligations, deletion semantics, audit export requirements, and a regulator's clock that starts the moment the first user's data lands in it. Memory is not a feature. It is the operational surface that every privacy regime, every enterprise procurement questionnaire, and every right-to-erasure request will eventually find.