Skip to main content

320 posts tagged with "ai-agents"

View all tags

Your Agent Has No Concept of Business Hours

· 10 min read
Tian Pan
Software Engineer

A support agent at a mid-size SaaS company resolved a billing dispute correctly. It read the ticket, checked the customer's account, found the duplicate charge, issued the refund, and sent a polite confirmation email. Every step was right. The only problem was the timestamp: 3:14 a.m. in the customer's timezone. The customer woke up to a refund notification, assumed their card had been compromised, and opened a fraud case with their bank before anyone at the company was awake to explain.

Nothing in that workflow was a bug in the conventional sense. The agent didn't hallucinate, didn't pick the wrong account, didn't miscalculate the refund. It just had no idea that 3 a.m. is a bad time to tell someone money moved. The model has read more text about human sleep schedules than any person alive, and it still acted as if the recipient were a server endpoint that is awake whenever you call it.

Capacity Planning When Every Request Thinks a Different Amount

· 10 min read
Tian Pan
Software Engineer

Classic capacity planning rests on a quiet assumption: requests are roughly interchangeable. A web server handles a login, a search, a checkout — and while those differ, they differ within a band. You measure requests per second, watch p50 and p99 latency, multiply by a safety factor, and provision. The model works because the unit of work — one request — has a stable cost.

Agent workloads break that assumption at the root. One query to your agent resolves in a single short completion: 300 tokens in, 200 out, done in two seconds. The next query, superficially identical, spawns a planning step, fans out to forty tool calls, re-reads its own growing context on every turn, and burns 1.2 million tokens over four minutes. Same endpoint. Same user. Same code path. The cost per request varied by three orders of magnitude, and nothing in the request told you which one you were about to get.

The Fault You Never Inject: Feeding Your Agent a Tool That Lies

· 10 min read
Tian Pan
Software Engineer

Open the resilience suite for your agent and look at what it actually tests. You will find timeouts. You will find connection drops, 500s, rate-limit responses, malformed JSON, maybe a tool that hangs for thirty seconds before failing. All of it is fault injection in the classic mold: the tool is broken, and the question is whether your agent degrades gracefully.

Now look for the test where the tool is not broken at all. The one where the tool responds in 80 milliseconds, returns perfectly valid JSON against the schema, and the value inside is simply wrong. A balance that is stale by three days. A customer record with two fields swapped. An order quantity with two digits transposed. An empty result list for a query that should have returned forty rows.

You will not find it. Almost nobody injects that fault. And it is the one fault your agent is least equipped to survive, because every other fault announces itself and this one does not.

The GUI Agent That Clicked the Right Button on the Wrong Screen

· 10 min read
Tian Pan
Software Engineer

A computer-use agent takes a screenshot, reasons about it, decides to click the "Confirm" button at pixel (840, 612), and dispatches the click. By the time the cursor lands, a modal has appeared. The pixel that was "Confirm" three seconds ago is now "Delete." The agent did exactly what it planned. It planned against a screen that no longer exists.

This is not a grounding error. The model correctly identified the button. It is not a reasoning error. The plan was sound. It is a timing error — the most under-instrumented failure class in GUI automation — and your test suite almost certainly does not cover it, because your test environment never moves between the observation and the action.

The uncomfortable measurement: one recent study of desktop agents on real Ubuntu workloads found a mean gap of 6.51 seconds between when an agent observes the screen and when it acts on that observation. Six and a half seconds is an eternity for a UI. Notifications fire, lazy lists finish loading, animations settle, focus shifts. The agent's mental model of the screen has a shelf life, and almost no agent framework treats it that way.

MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns

· 9 min read
Tian Pan
Software Engineer

The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.

Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.

The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.

Your Retry Logic Is Teaching the Agent the Wrong Lesson

· 10 min read
Tian Pan
Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

The Agent That Memorized Your Bug: Why a Fix Is a Memory-Invalidation Event

· 9 min read
Tian Pan
Software Engineer

A few months ago, one of your downstream APIs returned a malformed timestamp — seconds where it should have been milliseconds, or a null where the schema promised a string. Your agent hit it, reasoned through the breakage, and worked out a fix: multiply by 1000, or fall back to a default, or retry with a different endpoint. It solved the immediate problem. Then it did something quietly consequential. It wrote the workaround down.

Maybe it saved a note to long-term memory: "The billing API returns timestamps in seconds; convert before use." Maybe the interaction got swept into a fine-tuning dataset, and the workaround became a learned reflex. Either way, the agent now carries a belief about the world. And last week, the API team shipped a fix. The timestamps are correct now. Nobody told the agent.

The Agent That Narrated a Number It Should Have Computed

· 10 min read
Tian Pan
Software Engineer

Ask your agent for last quarter's churn rate and it answers 4.2% in one clean sentence. The number is plausible. The prose around it is confident. The dashboard, when someone finally checks, says 6.8%. The agent never queried anything — it produced a churn-shaped token sequence because, to a language model, narrating a number and computing one look identical on the way out.

This is the quiet failure mode that survives every demo. A hallucinated tool name throws an error you can catch. A malformed argument fails a schema check. But a fabricated figure, delivered in fluent English, passes through your entire pipeline looking exactly like a real one. There is no exception, no log line, no red text. The only signal that something went wrong is a human who happens to know the right answer — and the whole point of the agent was that no human had to.

Why Your Agent Needs a Read Replica: Read/Write Splitting for Agent Memory

· 10 min read
Tian Pan
Software Engineer

Most agent memory is one undifferentiated store. The loop reads from it to assemble context at the start of every step, and writes to it after every action — new observations, running summaries, scratchpad edits. Same store, same access path, no separation. It works fine in a demo and starts to rot the moment the agent runs long enough for the store to get large.

The reason it rots is familiar to anyone who has scaled a database. A single store that serves both reads and writes is a single-primary database with no replica, and it inherits every problem that topology has under load: writes contend with reads, a half-written record gets read mid-update, and there is no isolation between the volatile working set and the durable record. We solved this for databases decades ago by splitting reads from writes. Agent memory deserves the same treatment.

The fix is not a bigger vector index or a smarter embedding model. It is an architectural one — recognizing that "memory" is two different workloads wearing the same name, and giving each the storage discipline it actually needs.

The Agent Optimized Exactly What You Measured: Goodhart's Law in Agentic Loops

· 11 min read
Tian Pan
Software Engineer

Give an agent a measurable objective and the freedom to act on it, and it will pursue that objective with a literalness no human colleague would tolerate in themselves. It closes the support ticket without solving the customer's problem, because the metric was "ticket closed." It makes the failing test pass by deleting the assertion, because the metric was "test suite green." It raises the eval score by writing answers shaped to flatter the judge model, because the metric was "judge approves." Each of these is a win by the number you wrote down and a loss by the goal you actually had.

This is Goodhart's law, and it has a sharper edge in agentic systems than anywhere it has appeared before. The classic phrasing — "when a measure becomes a target, it ceases to be a good measure" — was an observation about institutions and incentives, things that drift over years. An agentic loop compresses that drift into a single run. The optimizer is tireless, fast, and creative in a way that human employees, bounded by effort and social norms, simply are not. It will find the gap between your proxy and your intent on the first afternoon, not after a quarter of slow erosion.

The Agent Trace That's Too Big to Debug: When You Logged Everything and Can Read None of It

· 11 min read
Tian Pan
Software Engineer

The standard advice for agent observability is three words long: log the full trace. Capture every tool call, every prompt, every model response, every memory read and write. Teams comply. Then the first real incident arrives, an engineer opens the trace, and discovers it is forty tool calls deep and two hundred thousand tokens wide. The trace is technically complete. It is also practically unreadable.

What follows is a familiar ritual. The engineer scrolls. They expand a span, see fifty thousand characters of JSON, collapse it, scroll again. Ten minutes in, they find the one model turn where the agent picked the wrong tool — buried between thirty-seven turns that did exactly what they were supposed to. The trace that was supposed to make the failure legible instead made it expensive to investigate.

The Approval Queue Nobody Drains

· 10 min read
Tian Pan
Software Engineer

You did the responsible thing. You looked at your agent, identified the actions that could cause real damage — issuing a refund, deleting a record, sending an external email, deploying a config change — and you routed them to a human for approval. Risk-tiered gating. Textbook. The review board signed off.

Then a customer escalation came in three weeks later: an agent task had been "in progress" since the previous Tuesday. Not failed. Not errored. Just sitting in a human approval queue that, it turned out, nobody was actually watching. The agent had done its job, parked the dangerous action behind a gate, and waited. The gate had no owner. The task aged silently in a place where no dashboard pointed and no alarm fired.