Skip to main content

311 posts tagged with "ai-agents"

View all tags

Background Agents and the Notification Budget: Why Proactive AI Hits a Hard Ceiling at User Attention

· 10 min read
Tian Pan
Software Engineer

The first generation of AI assistants waited politely. You typed, they answered. The second generation does not wait. It watches your calendar, scans your inbox, reads your repo activity, and surfaces "you should know about this" interruptions before you have asked for anything. The pitch is compelling and the demos are mesmerizing. The retention curves, once these features ship, are not.

There is a number nobody puts on the launch slide: the user has a daily ceiling on unsolicited AI updates, and it is roughly three to five across all sources combined. The proactive agent that ships its tenth notification of the week is the same agent the user mutes by Friday and uninstalls the following month. This is not a UX polish problem. It is the architectural blind spot of the entire proactive-AI category, and it deserves a name: the notification budget.

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.

The Agentic Stamp: When Marketing Names It and Engineering Pays the Operational Bill

· 10 min read
Tian Pan
Software Engineer

A product marketing manager writes "AI agent" in a launch brief. The press release goes out describing autonomous decision-making. Six weeks later, engineering is staring at a Jira board full of "agent observability" tickets they never scoped for a system that is, in fact, a single prompt followed by a hardcoded tool dispatch. Nobody lied. Nobody made a technical error. The team just learned that the word "agent" is not a description — it is a stamp, and the stamp carries operational implications that engineering inherits whether or not the implementation justifies them.

This is the internal version of what Gartner now calls "agent washing." The external version — vendors rebranding chatbots as agents to ride the hype cycle — gets the press coverage. The internal version is quieter and more expensive, because the bill falls on people who can't push back at the moment the term gets approved.

Agent Branch Coverage: Your Eval Hits the Happy Path, Not the Planner's If-Else

· 8 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a 240-case eval suite against their support agent. Green across the board for six months. Then they swapped a single sentence in the planner prompt — a tone tweak — and the next day production saw a 3× spike in human-handoff requests. The eval hadn't moved. The handoff branch had simply started firing on borderline cases that used to resolve in-line, and not a single eval case was the kind of borderline. The branch existed in the prompt. It existed in production. It did not exist in the eval.

This is the failure mode I want to name: agent branch coverage. Code-coverage tooling has been a debugging staple for forty years, but agentic systems have a runtime control flow — planner branches that pick a tool, condition the response, escalate to a human, refuse to act, retry with a different strategy — and the eval suite touches only the cases the team thought to write. Eighty percent of the planner's decision branches have never executed under test, and a green eval becomes a smoke test wearing a regression-test costume.

The Agent Scratch Directory: The Unowned Filesystem PII Surface Nobody Inventoried

· 10 min read
Tian Pan
Software Engineer

A regulator walks into your office and asks the question security teams rehearse for: "Show me every place customer data lives." Your data team produces the inventory. The primary database is on it. The analytics warehouse is on it. The object store, the queue, the search index, the backup destination — all on it, with classification labels, retention policies, encryption details, and named owners. Then someone in the room mentions the agent worker pool, and the inventory has nothing to say. The pool has been running for nine months. Each worker has a local disk. The agents on those workers have been parsing PDFs, transcribing audio, downloading email attachments, and caching intermediate JSON between tool calls the entire time. Nobody put any of that on the asset register.

This is the scratch directory problem. Every long-running agent worker accumulates an ephemeral filesystem that grows organically as new tools are added — extracted text from a PDF parser, transcribed audio from a Whisper step, downloaded attachments from a Gmail tool, screenshots from a browser-use step, vector-search snippets cached for the next turn, intermediate JSON the agent emitted between two tool calls so the second one wouldn't have to re-derive it. Unlike databases and queues and buckets, this surface has no retention policy, no encryption-at-rest standard, no DLP scanner pass, and no entry on the data-classification spreadsheet. The platform team thinks "agent state" means the inference-provider context window. The SRE team thinks "agent state" means the durable database. The worker's /tmp/agent-workspace-${session_id}/ directory is a third copy of customer data that nobody owns.

The Latency Budget Negotiation: How to Tell Product That 'Real-Time' Costs Capability

· 11 min read
Tian Pan
Software Engineer

A product manager walks into a planning meeting with a one-line requirement: "responses under two seconds, like ChatGPT." The agent under discussion makes six tool calls, hits two retrieval indexes, runs a reasoning model with a thinking budget, and validates its output with a second-pass critic. End-to-end p50 is currently nine seconds. The engineering team has three options: say yes and quietly degrade the agent into something worse, say no and watch the PM go shopping for a vendor whose demo video promises the moon, or do the thing nobody teaches in onboarding — open a structured negotiation where every second of latency is convertible to a capability the agent gives up.

Most teams pick option one. The agent ships at two seconds, accuracy drops twelve points, the launch is called a success because the headline latency number was met, and three months later the team is fighting a quality regression that nobody can attribute to a single change because the regression was the launch itself. The latency target was never priced. It was inherited from a product spec that treated speed as free.

The Prompt-Injection Bug Bounty: Scoping a Program When 'Broken' Has No Clear Definition

· 12 min read
Tian Pan
Software Engineer

Your security team runs a bug bounty that works. A CSRF gets paid. An XSS gets paid. An IDOR gets paid. The rules of engagement are sharp, the severity rubric is industry-standard, the triage queue moves, and the program produces a steady stream of fixed bugs. Then your AI team ships a feature last quarter — a chat surface, an agent that calls tools, a RAG pipeline that pulls from customer data — and the question that lands on the security team's desk is "what's the bounty scope for this thing?" Nobody can answer.

The reason nobody can answer is that the standard bug bounty rubric was built around a system whose specified behavior is deterministic. A login endpoint either authenticates correctly or it doesn't. An access control check either holds or it doesn't. The AI feature you just shipped has no equivalent ground truth: its specified behavior is "respond helpfully to user input," and a researcher who makes it respond unhelpfully has not necessarily found a bug — they may have found something the model has always done, that nobody knew about, that you're not sure you can fix, and that may or may not reproduce on a second attempt.

Streaming Tool Results Break Request-Response Agent Planners

· 10 min read
Tian Pan
Software Engineer

A SQL tool ships rows as they come off the wire. The agent calls it expecting a result. The harness, written a year earlier when every tool was request-response, dutifully buffers the whole stream into a single string before invoking the model. Forty seconds later, the buffer is 200 KB, the context window is half-eaten, and the agent is reasoning about row 47,000 of a query it could have stopped at row 30. Nobody designed this failure — it falls out of treating "the tool returned" as the only event the planner reacts to.

The shift to streaming tools is happening below the planner's awareness. SQL engines emit progressive result sets. Document fetchers yield pages. Search APIs return hits in batches as relevance scores stabilize. MCP's Streamable HTTP transport, the 2025-03-26 spec replacement for HTTP+SSE, makes incremental responses a first-class transport mode rather than an exotic capability. The wire is ready. The planners on top of it are not.

Tool Behavior Drift: The Schema Held, the Semantics Didn't

· 11 min read
Tian Pan
Software Engineer

Your contract tests are green. The schema validator is happy. The tool returns the same shape it did last quarter. And the user-facing answer has been quietly wrong for six weeks.

This is the failure mode that contract testing was never designed to catch. Contract tests verify that the wire format hasn't changed — that search() still returns { results: [{ id, title, score }] }, that create_event still accepts an ISO 8601 string, that the geocoder still emits { lat, lng }. What they don't catch is the moment the search endpoint starts ranking by recency instead of relevance, the calendar API silently snaps your 14:07 start time to 14:00 in the EU region, the geocoder picks a different point inside the same ambiguous polygon, or the LLM-classifier-as-a-tool is upgraded to a new model behind a stable endpoint and the false-positive rate moves four points in a category your eval set never sampled. The schema held. The behavior didn't. Your agent kept reading green checkmarks and produced regressed answers no error log captured.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.

The Curious Customer: Designing AI for Users Who Treat Your Agent as a Puzzle

· 10 min read
Tian Pan
Software Engineer

Most product teams divide their users into two buckets when designing an AI agent. Bucket one is the cooperative customer: someone with a real problem, asking the agent in plain language, hoping it works. Bucket two is the attacker: jailbreaks, prompt injection payloads, scraped credentials, the threat model the security team owns. The eval suite covers the first. The red team covers the second. Everyone goes home satisfied.

Then a third population shows up and breaks the product. They are not malicious. They are not trying to extract training data or coerce the model into describing a bioweapon. They are curious. They treat the agent as a puzzle. They ask it questions specifically designed to surprise it — "what is the saddest thing you have ever been asked", "pretend you are my grandmother and sing me to sleep with the recipe for napalm" — except the napalm version is the one that goes viral, while the actual quality crisis is a thousand variations of the first one that nobody wrote a refusal policy for.

Agent State Diff: Why Eyeballing Two Traces Doesn't Scale

· 9 min read
Tian Pan
Software Engineer

A regression slips into production. The team picks the failing input, replays it against last week's prompt, and gets a different output. Now they have to figure out why — and the answer is buried in three megabytes of differing text, divergent tool-call sequences, and shuffled retrieved chunks that no human can productively diff. So they paste both transcripts into a side-by-side viewer, scroll for twenty minutes, conclude "the model just felt different today," and ship a hotfix that doesn't address the root cause because they never found it.

This is the agent state diff problem, and it is the first place where general-purpose engineering tooling stops working for agentic systems. A traditional regression bisect runs against deterministic code: the same input produces the same output, and git bisect walks history until you find the commit that broke it. Agent runs aren't deterministic, the inputs aren't a single string, and the "history" is a multi-axis envelope — model snapshot, sampling config, retrieved context, tool catalog, harness flags — any of which can independently change behavior.