Skip to main content

311 posts tagged with "ai-agents"

View all tags

The Feature Store Your Agent Reinvented Badly

· 10 min read
Tian Pan
Software Engineer

Watch a support agent handle one conversation, and count how many times it computes "churn risk." First when it triages the ticket. Again when it decides whether to offer a discount. A third time when it drafts the escalation summary. Each time, it re-reads the raw orders table, re-runs an inline aggregation, and produces a number. The three numbers don't match. Nobody notices, because they were never written down next to each other.

This is feature engineering. The agent is doing it on every turn, in prose, and doing it worse than a pipeline you would have laughed out of code review a decade ago.

The machine learning world already solved this. The solution is called a feature store, and the discipline it enforces — compute a feature once, name it, version it, serve it consistently — is exactly the discipline an agent throws away the moment you hand it a database tool. Your agent didn't avoid building a feature pipeline. It built one. It just built the worst one in the building.

Your Happy Path Is Your Expensive Path: The Agent That Costs More When It Wins

· 10 min read
Tian Pan
Software Engineer

A failed agent run is cheap. It misroutes a query, hits a dead end, returns "I couldn't help with that," and burns maybe a few hundred tokens doing it. A successful run is the disaster. It retrieves context, reflects on it, calls three tools, reflects again, and stitches together a confident multi-paragraph answer — fifty times the token spend of the failure that cost you nothing.

This is the uncomfortable shape of agent economics: your happy path is your expensive path. The outcome you are selling, the one your marketing page promises, the one users thank you for, is the single most costly thing your system can do. And if you priced the product the way SaaS has been priced for fifteen years — a flat monthly fee per seat — then every time the agent does its job well, it quietly erodes your margin.

Most teams discover this backwards. They watch cost dashboards, see failures are cheap, and conclude that reliability work will save money. It won't. Raising your success rate raises your bill.

Your Agent Endpoint Is a Distributed System Pretending to Be a Function Call

· 9 min read
Tian Pan
Software Engineer

The most dangerous line of code in a modern AI application looks completely innocent:

result = await agent.run(user_query)

It reads like a function call. It has a name, it takes an argument, it returns a value. Your IDE autocompletes it. Your type checker is satisfied. And that single await is hiding a remote, multi-hop, partially-failing distributed system behind the syntax of a local procedure. The gap between what the code looks like and what it actually does is where most production agent incidents live.

Your Agent Has No Concept of Business Hours

· 10 min read
Tian Pan
Software Engineer

A support agent at a mid-size SaaS company resolved a billing dispute correctly. It read the ticket, checked the customer's account, found the duplicate charge, issued the refund, and sent a polite confirmation email. Every step was right. The only problem was the timestamp: 3:14 a.m. in the customer's timezone. The customer woke up to a refund notification, assumed their card had been compromised, and opened a fraud case with their bank before anyone at the company was awake to explain.

Nothing in that workflow was a bug in the conventional sense. The agent didn't hallucinate, didn't pick the wrong account, didn't miscalculate the refund. It just had no idea that 3 a.m. is a bad time to tell someone money moved. The model has read more text about human sleep schedules than any person alive, and it still acted as if the recipient were a server endpoint that is awake whenever you call it.

Capacity Planning When Every Request Thinks a Different Amount

· 10 min read
Tian Pan
Software Engineer

Classic capacity planning rests on a quiet assumption: requests are roughly interchangeable. A web server handles a login, a search, a checkout — and while those differ, they differ within a band. You measure requests per second, watch p50 and p99 latency, multiply by a safety factor, and provision. The model works because the unit of work — one request — has a stable cost.

Agent workloads break that assumption at the root. One query to your agent resolves in a single short completion: 300 tokens in, 200 out, done in two seconds. The next query, superficially identical, spawns a planning step, fans out to forty tool calls, re-reads its own growing context on every turn, and burns 1.2 million tokens over four minutes. Same endpoint. Same user. Same code path. The cost per request varied by three orders of magnitude, and nothing in the request told you which one you were about to get.

The Fault You Never Inject: Feeding Your Agent a Tool That Lies

· 10 min read
Tian Pan
Software Engineer

Open the resilience suite for your agent and look at what it actually tests. You will find timeouts. You will find connection drops, 500s, rate-limit responses, malformed JSON, maybe a tool that hangs for thirty seconds before failing. All of it is fault injection in the classic mold: the tool is broken, and the question is whether your agent degrades gracefully.

Now look for the test where the tool is not broken at all. The one where the tool responds in 80 milliseconds, returns perfectly valid JSON against the schema, and the value inside is simply wrong. A balance that is stale by three days. A customer record with two fields swapped. An order quantity with two digits transposed. An empty result list for a query that should have returned forty rows.

You will not find it. Almost nobody injects that fault. And it is the one fault your agent is least equipped to survive, because every other fault announces itself and this one does not.

The GUI Agent That Clicked the Right Button on the Wrong Screen

· 10 min read
Tian Pan
Software Engineer

A computer-use agent takes a screenshot, reasons about it, decides to click the "Confirm" button at pixel (840, 612), and dispatches the click. By the time the cursor lands, a modal has appeared. The pixel that was "Confirm" three seconds ago is now "Delete." The agent did exactly what it planned. It planned against a screen that no longer exists.

This is not a grounding error. The model correctly identified the button. It is not a reasoning error. The plan was sound. It is a timing error — the most under-instrumented failure class in GUI automation — and your test suite almost certainly does not cover it, because your test environment never moves between the observation and the action.

The uncomfortable measurement: one recent study of desktop agents on real Ubuntu workloads found a mean gap of 6.51 seconds between when an agent observes the screen and when it acts on that observation. Six and a half seconds is an eternity for a UI. Notifications fire, lazy lists finish loading, animations settle, focus shifts. The agent's mental model of the screen has a shelf life, and almost no agent framework treats it that way.

MCP Server Sprawl: The Unbounded Tool Surface Nobody Owns

· 9 min read
Tian Pan
Software Engineer

The Model Context Protocol did exactly what it set out to do: it made giving an agent a new capability almost free. Wiring in a calendar server, a database server, an internal company server, or one of the 30,000-tool catalogs that vendors now publish is a config change, not a project. That frictionlessness is the feature. It is also the problem.

Because adding a tool is cheap, every team adds tools. The data team wires in a warehouse server. The support team adds a ticketing server. Someone connects a filesystem server for a one-off task and never removes it. None of these decisions is wrong. But there is no decision that owns their sum — the aggregate tool surface your agent now carries on every single request. The tool list has become a dependency graph with a real carrying cost, and in most organizations it is the one dependency graph nobody is responsible for.

The result is sprawl: a tool catalog that grows monotonically, gets reviewed by no one, costs more every quarter, and quietly makes the agent worse. This is the unowned surface, and it deserves the same scrutiny you already give your API surface and your npm tree.

Your Retry Logic Is Teaching the Agent the Wrong Lesson

· 10 min read
Tian Pan
Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

The Agent That Memorized Your Bug: Why a Fix Is a Memory-Invalidation Event

· 9 min read
Tian Pan
Software Engineer

A few months ago, one of your downstream APIs returned a malformed timestamp — seconds where it should have been milliseconds, or a null where the schema promised a string. Your agent hit it, reasoned through the breakage, and worked out a fix: multiply by 1000, or fall back to a default, or retry with a different endpoint. It solved the immediate problem. Then it did something quietly consequential. It wrote the workaround down.

Maybe it saved a note to long-term memory: "The billing API returns timestamps in seconds; convert before use." Maybe the interaction got swept into a fine-tuning dataset, and the workaround became a learned reflex. Either way, the agent now carries a belief about the world. And last week, the API team shipped a fix. The timestamps are correct now. Nobody told the agent.

The Agent That Narrated a Number It Should Have Computed

· 10 min read
Tian Pan
Software Engineer

Ask your agent for last quarter's churn rate and it answers 4.2% in one clean sentence. The number is plausible. The prose around it is confident. The dashboard, when someone finally checks, says 6.8%. The agent never queried anything — it produced a churn-shaped token sequence because, to a language model, narrating a number and computing one look identical on the way out.

This is the quiet failure mode that survives every demo. A hallucinated tool name throws an error you can catch. A malformed argument fails a schema check. But a fabricated figure, delivered in fluent English, passes through your entire pipeline looking exactly like a real one. There is no exception, no log line, no red text. The only signal that something went wrong is a human who happens to know the right answer — and the whole point of the agent was that no human had to.

Why Your Agent Needs a Read Replica: Read/Write Splitting for Agent Memory

· 10 min read
Tian Pan
Software Engineer

Most agent memory is one undifferentiated store. The loop reads from it to assemble context at the start of every step, and writes to it after every action — new observations, running summaries, scratchpad edits. Same store, same access path, no separation. It works fine in a demo and starts to rot the moment the agent runs long enough for the store to get large.

The reason it rots is familiar to anyone who has scaled a database. A single store that serves both reads and writes is a single-primary database with no replica, and it inherits every problem that topology has under load: writes contend with reads, a half-written record gets read mid-update, and there is no isolation between the volatile working set and the durable record. We solved this for databases decades ago by splitting reads from writes. Agent memory deserves the same treatment.

The fix is not a bigger vector index or a smarter embedding model. It is an architectural one — recognizing that "memory" is two different workloads wearing the same name, and giving each the storage discipline it actually needs.