Skip to main content

320 posts tagged with "ai-agents"

View all tags

When No One Answers the Escalation: Human-in-the-Loop Is a Staffing Problem

· 10 min read
Tian Pan
Software Engineer

Every agent architecture diagram has a box labeled "escalate to human." It is drawn with a clean arrow, it satisfies the reviewer, and it makes the system feel safe. What the diagram never shows is the person on the other end of that arrow — whether they exist, whether they are awake, and whether they will answer before the agent's patience runs out.

Human-in-the-loop is sold as a design pattern. In production it behaves like a staffing problem. The pattern assumes a human is standing by; the staffing reality is that escalations do not arrive when humans are available — they arrive on their own schedule. A burst at 2am when an overnight batch job trips a guardrail. A long tail through lunch when half the reviewers are away from their desks. A steady drip that quietly outgrows the two-person team that looked sufficient during the demo, when the agent handled ten requests a day instead of ten thousand.

The gap between "we have an escalation path" and "escalations get answered" is where agentic systems fail in ways no eval catches. The eval measures whether the agent escalates correctly. It never measures whether anyone was there.

Who Gets Paged When the Agent Is Wrong: On-Call for Non-Deterministic Systems

· 9 min read
Tian Pan
Software Engineer

The on-call rotation was built around a promise: failures reproduce. An alert fires, you re-run the request, you watch the bug happen, you find the bad commit, you roll back the deploy. Every part of that loop assumes determinism. The same input produces the same output, and the output is either right or wrong in a way you can stare at.

An agent fleet quietly breaks every link in that chain. The failure happened once, at a sampling temperature you can't replay, on a context window that has since been garbage-collected. There is no bad commit, because the code never changed — the model did, or the retrieved documents did, or the user phrased the request in a way nobody anticipated. You roll back the deploy and the deploy was never the problem.

So the page goes out, an engineer picks it up, and they discover the most uncomfortable fact about operating agents in production: they have been handed a system they cannot single-step, and the runbook in front of them was written for a different kind of machine.

Who Pays for the Model's Mistake: Designing Liability Into Agent Products

· 9 min read
Tian Pan
Software Engineer

An agent books the wrong flight. It sends an apology email to the wrong customer. It writes a database migration that drops a column three services still read from. In each case the model produced a plausible-looking action, executed it, and moved on. And in each case somebody absorbed a real cost — a rebooking fee, a damaged relationship, an incident bridge at 2 a.m.

Here is the uncomfortable part: most AI products have no answer for who that somebody is. The question never comes up in the design review. It surfaces later, one ticket at a time, in a support queue where an agent improvised a $40 credit because the customer sounded angry and the rep had no policy to point at. Multiply that by a few thousand tickets a month and the unit economics quietly rot — not from a dramatic failure, but from a slow leak nobody scoped.

"The model made a mistake" is not a support escalation. It is a billing event. And the products that survive the agentic era will be the ones that designed for that event before the first angry ticket, not the ones that improvised refunds by vibes until the gross margin went negative.

Agent Memory Is a Cache With No Invalidation Policy

· 9 min read
Tian Pan
Software Engineer

Every agent framework now ships "long-term memory" as a headline feature, and every team adopts it as an unambiguous good. The agent remembers the user's preferences, prior decisions, project context, and the corrections it was given last week, so each session starts warmer than the last. The demo is irresistible: a user says "set up the project the way I like it" and the agent just does it. Nobody asks the obvious question, because the framing of the feature actively discourages it.

The question is: when does any of that stop being true?

A memory store is a cache. It holds facts about a world that does not hold still. The agent recorded "the user prefers Postgres" eight months ago, and the team has since migrated to a different database. The agent remembers "the user is on the growth team," and the user changed roles in March. The agent stored a tidy summarized conclusion from a conversation whose premises were corrected two messages later. And the memory layer surfaces all of it with exactly the same confident freshness as a fact written this morning. We have spent fifty years learning that a cache without an invalidation policy is a correctness bug. Then we built agent memory and shipped it without one.

Latency-Aware Tool Selection: When 'Good Enough Now' Beats 'Best Available Later'

· 10 min read
Tian Pan
Software Engineer

The tool description in your agent's system prompt is a six-month-old eval artifact. It says search_pricing returns "fresh inventory data with structured pricing" and the planner believes it, because nothing in the prompt has updated since the day the description was tuned. The actual search_pricing endpoint has been sitting at p95 of 11 seconds for the last forty minutes because the upstream vendor is rate-limiting your account, and the cheaper search_cache tool — which the prompt describes as "may be slightly stale" — would return the same answer in 200ms. The planner picks search_pricing anyway, because the description still reads like it did during eval, and the planner has no signal about what either tool costs to call right now.

This is the structural failure of static tool descriptions. The planner is making routing decisions on a snapshot of a world that has moved on. Tool selection isn't really a capability question — most production agents have two or three tools that overlap heavily in what they can answer — it's a cost-of-waiting question, and the cost of waiting is the thing your prompt template doesn't see.

The Prompt Graph Inside Your Agent: Cross-Prompt Regression Chains Nobody Mapped

· 11 min read
Tian Pan
Software Engineer

A senior engineer ships a four-word edit to the planner prompt — "if uncertain, ask first." The planner's own eval set, which grades whether plans are reasonable, moves up by half a point. They merge. Two weeks later, the verifier's eval shows a three-point pass-rate regression and nobody can repro it. The root cause turns out to be that the planner now asks more clarifying questions, the executor receives shorter task descriptions on the second turn, the verifier's rubric was implicitly tuned against the previous executor's longer outputs, and an edit nobody flagged as risky has shifted three downstream distributions at once.

This is what happens when you treat the prompts inside an agent as a flat folder of files instead of as a graph with edges. The prompts have owners. The edges between them have nobody.

The Second-Draft Agent Pattern: Why Explore-Then-Commit Beats Self-Critique

· 12 min read
Tian Pan
Software Engineer

When a single-pass agent stops being good enough, the default move is to wrap it in a self-critique loop. Generate, critique, revise, repeat. Most teams I talk to assume the eval lift will be roughly linear with the number of revision rounds and stop there. The numbers rarely cooperate. By the third round of self-critique, accuracy is up two or three points and token cost is up 3–4x, and the failure modes that didn't get caught in round one mostly don't get caught in round three either — because the same context that produced the wrong answer is the one being asked to spot the wrongness.

A different shape works better and costs less: let the first pass be wasteful exploration, throw it away, and run a second pass from a clean context with just the lessons learned. Call it the second-draft pattern, or explore-then-commit. The first draft is permitted to be sloppy, to take dead ends, to dump scratch artifacts, to chase hypotheses that turn out to be wrong. The second draft is constrained — it gets the distilled findings and produces a clean execution. On the kinds of tasks where self-critique is tempting (multi-step reasoning, code that touches several files, research syntheses), this two-pass shape often beats n-of-k self-critique on both quality and cost.

Agent Circuit Breakers: Why Step Budgets Are Fuses, Not Breakers

· 12 min read
Tian Pan
Software Engineer

Every team that ships agents to production eventually wakes up to the same kind of incident. An agent enters a state it cannot exit. It re-calls the same tool with cosmetically different arguments for six hours. It oscillates between two plans whose preconditions reject each other. It retries a transient 429 every two hundred milliseconds until morning. It generates a million-token plan it never executes. By the time anyone notices, the token bill is four figures, the downstream API is rate-limited, the customer's session has timed out twelve times, and the on-call engineer is being paged by three different alerts about the same root cause.

The first fix every team reaches for is a step-count budget. Cap the agent at twenty iterations. Cap it at fifty. Pick a number and ship. The step budget makes the incident reports stop, but it does not make the underlying problem go away — and once you understand the mechanism, you can see why a step budget is the agent equivalent of a household fuse: it blows after the damage has been done, the fuse box itself is now a maintenance burden, and the next time something melts, your reflex is to swap in a higher-rated fuse rather than ask what is actually shorting.

Agent Memory Is a Compliance Surface: The Records-Management System You Didn't Sign Up to Build

· 12 min read
Tian Pan
Software Engineer

The first compliance escalation against your agent memory layer almost never arrives as a regulator's letter. It arrives as a Jira ticket from your enterprise sales engineer that says "the customer's privacy team is blocking the contract — they want to know what 'forget my user' actually means in your system, and they want a written answer by Friday." That ticket lands six to twelve months after the memory layer shipped, and the engineering team that built it discovers, in the time it takes to read the question, that they accidentally built a records-management system without any of the primitives a records-management system is supposed to have.

This is the structural problem with long-term memory in agentic products. The team building it optimizes for the things memory is sold to do — retrieval quality, latency, storage cost, the felt-personalization that makes the assistant feel like it knows the user. Nobody in the design review prices the parallel system being built at the same time: a per-user, per-tenant, multi-region data store with retention obligations, deletion semantics, audit export requirements, and a regulator's clock that starts the moment the first user's data lands in it. Memory is not a feature. It is the operational surface that every privacy regime, every enterprise procurement questionnaire, and every right-to-erasure request will eventually find.

Background Agents and the Notification Budget: Why Proactive AI Hits a Hard Ceiling at User Attention

· 10 min read
Tian Pan
Software Engineer

The first generation of AI assistants waited politely. You typed, they answered. The second generation does not wait. It watches your calendar, scans your inbox, reads your repo activity, and surfaces "you should know about this" interruptions before you have asked for anything. The pitch is compelling and the demos are mesmerizing. The retention curves, once these features ship, are not.

There is a number nobody puts on the launch slide: the user has a daily ceiling on unsolicited AI updates, and it is roughly three to five across all sources combined. The proactive agent that ships its tenth notification of the week is the same agent the user mutes by Friday and uninstalls the following month. This is not a UX polish problem. It is the architectural blind spot of the entire proactive-AI category, and it deserves a name: the notification budget.

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.

The Agentic Stamp: When Marketing Names It and Engineering Pays the Operational Bill

· 10 min read
Tian Pan
Software Engineer

A product marketing manager writes "AI agent" in a launch brief. The press release goes out describing autonomous decision-making. Six weeks later, engineering is staring at a Jira board full of "agent observability" tickets they never scoped for a system that is, in fact, a single prompt followed by a hardcoded tool dispatch. Nobody lied. Nobody made a technical error. The team just learned that the word "agent" is not a description — it is a stamp, and the stamp carries operational implications that engineering inherits whether or not the implementation justifies them.

This is the internal version of what Gartner now calls "agent washing." The external version — vendors rebranding chatbots as agents to ride the hype cycle — gets the press coverage. The internal version is quieter and more expensive, because the bill falls on people who can't push back at the moment the term gets approved.