161 posts tagged with "agents"

Your stop_reason Is Lying: Building the Real Stop Taxonomy Production Triage Needs

April 27, 2026 · 12 min read

Software Engineer

The on-call engineer pulls up a trace. The model returned, the span closed clean, the API call shows stop_reason: end_turn. By every signal the platform offers, this was a successful generation. Three minutes later a customer reports that the agent confidently wrote half a config file, declared the operation complete, and moved on. The trace had no warning sign because the warning sign isn't in the API contract — the provider's stop reason has four to seven buckets, and the question your incident demands an answer to lives in the gap between them.

Stop reasons are the field engineers reach for first during triage and the field that lies most cleanly when it does. The values are designed for a runtime that needs to decide what to do next: was this turn complete, did a tool get requested, did a budget get exceeded, did safety intervene. They are not designed for a human reconstructing why an answer went wrong, and the difference between those two purposes is where production teams burn entire afternoons.

Structured Concurrency for Parallel Tool Fanout: Who Owns Partial Failure?

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

The moment your agent fans out five parallel tool calls — search across three indexes, query two databases, hit one external API — you have crossed an invisible line. You are no longer writing prompt-and-response code. You are writing a concurrent program. Most agent frameworks pretend you are not, and the bill arrives at 2 AM.

The pretense is comfortable. The planner emits a list of tool calls, the runtime fires them off, the runtime collects whatever comes back, the planner consumes the aggregate. From a thousand feet up it looks like a fan-out / fan-in pipeline, and most teams treat it that way until production teaches them otherwise. The problem is that twenty years of concurrent-programming research — partial-failure semantics, structured cancellation, backpressure, deterministic error attribution — already solved the failure modes you are about to rediscover. Your agent framework, by default, did not import any of it.

Token Amplification: The Prompt-Injection Attack That Burns Your Bill

April 27, 2026 · 10 min read

Tian Pan

Software Engineer

A user submits a $0.01 request. Your agent reads a webpage. Forty seconds later, the inference bill for that single turn is $42. The query was technically successful — the agent returned a reasonable answer. It just took three nested sub-agents, a 200K-token document fetch, and a recursive plan refinement loop to get there. None of that fanout was the user's idea. It was a sentence buried in the page the agent read.

This is token amplification: a prompt-injection class that does not exfiltrate data, does not call unauthorized tools, and does not leave a clean security signature. It just sets your bill on fire. The cloud bill is the payload, and the user's request is the carrier.

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

April 27, 2026 · 11 min read

Tian Pan

Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

Your Agent's Outbox Is Your Next Deliverability Incident

April 26, 2026 · 11 min read

Tian Pan

Software Engineer

The first time it happens, the on-call engineer is staring at a Gmail Postmaster dashboard that has gone solid red, the support inbox is on fire because customer password resets are landing in spam, and the agent that did this is still running. It sent eighty thousand "personalized follow-ups" between 4 a.m. and 9 a.m. local time, all from the company's primary sending domain, all signed with the same DKIM key the billing system uses. By the time anyone notices, the domain reputation that took three years to build is gone, and so are the next six weeks of inbox placement on every transactional message the company depends on.

Sending email from an agent looks like a one-line tool call. send_email(to, subject, body) is the canonical demo, and every framework ships it as a starter integration. But email is not like other tools. A bad database query rolls back. A bad API call returns an error. A bad batch of email lowers the deliverability of every other email your company sends, for weeks, and there is no transaction to roll back because the messages are already in flight to recipient mailservers that are now writing your domain's reputation history.

Your APIs Assumed One Human at a Time. Parallel Agents Broke the Contract.

April 26, 2026 · 12 min read

Tian Pan

Software Engineer

A backend engineer I know spent a Tuesday afternoon staring at a Datadog graph that had never spiked before: the per-user 429 counter on their internal calendar service. The customer complaining had not changed their behavior. They had simply turned on the assistant feature, which now spawned eight planning threads in parallel against the same calendar API every time the user said "find me time next week." The rate limiter — a perfectly reasonable 60 requests per minute per user, written years ago against a UI that physically could not click that fast — was firing within the first three seconds of every request and silently corrupting half the assistant's responses.

The rate limit was not the bug. The contract was the bug. That backend, like most internal services written before 2024, had a quietly enforced assumption baked into every layer: one user means one stream of activity, paced by a human's reaction time, with one cookie jar, one CSRF token, and one set of credentials that could be re-prompted if anything went wrong. Agents shred all five of those assumptions at once, and the failures show up as a constellation of unrelated incidents — 429 storms, last-write-wins corruption, audit logs you can't subpoena, re-auth loops that hang headless workers — that nobody connects until the pattern is named.

The shorthand I have been using with platform teams is this: every backend you own has an undocumented contract with its callers, and that contract was negotiated with humans. Agents are now showing up to renegotiate. You can either do the renegotiation deliberately, in code review, or you can do it during your next incident.

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

April 26, 2026 · 11 min read

Tian Pan

Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

April 24, 2026 · 11 min read

Tian Pan

Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Every AI feature launch has the same quiet moment before the first user sees it: someone on the team asks "how do we know this is good?" and the honest answer is "we don't, yet." You have no traces because you have no users. You have no users because you haven't shipped. The loop is real, and the two failure modes it produces are both fatal — ship blind and let the first week of escalations be your eval dataset, or wait for "real data" and watch the roadmap slide for a quarter while a competitor publishes a demo.

The way out is not to pretend cold-start evaluation is the same problem as post-launch evaluation with a smaller sample size. It isn't. You are not sampling a distribution; you are constructing a prior. Every day-1 signal is an artifact of a choice you made about what to measure, whose behavior to simulate, and which failures to care about. Teams that ship AI features well treat the pre-launch eval stack as a first-class deliverable — not a spreadsheet hacked together the night before the gate review, but a layered system of dogfooding, simulation, expert annotation, and adversarial probes, each contributing a different kind of signal and each weighted with an explicit story about what it can and cannot tell you.

Conversation History Is a Liability Your Prompt Never Admits

April 23, 2026 · 10 min read

Tian Pan

Software Engineer

Read your product's analytics the next time a user says "the AI got dumber today." Filter to sessions over twenty turns. You will find the same U-shape every time: early turns score well, middle turns score well, late turns fall off a cliff. The prompt hasn't changed. The model hasn't changed. What changed is that every one of those late turns is carrying a payload of user typos, false starts, model hedges, corrections that were later reversed, tool outputs nobody re-read, and the fossilized remains of a goal that the user abandoned on turn four. Your prompt template treats this sediment as signal. The model does too. It shouldn't.

Chat history is not free context. It is a liability you are paying to re-send on every turn, and the dirtier it gets, the more it corrupts the answer you are billing the user for. The chat metaphor is the source of the confusion. Chat interfaces habituate users and engineers to treat the transcript as sacred — scrollable, append-only, never reset. That habit is imported wholesale into LLM applications even though it has no physical basis in how models process context. The model is stateless. The transcript is just a string you chose to grow. You can shrink it. You often should.

Durable Agents: Why Async Queues Break for Long-Running AI Workflows

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

An agent that works 95% of the time per step is not a 95% reliable agent. Chain twenty steps together and the end-to-end completion rate drops to 36%. This is the arithmetic most teams discover only after their agent hits production, and it is the reason so many "working" prototypes stall the moment real traffic arrives. The fix is not better prompts or bigger models. It is a boring piece of distributed systems infrastructure most AI teams try to avoid until the third outage forces their hand.

The infrastructure is durable execution — the discipline of making a multi-step workflow survive crashes, restarts, and partial failures without losing its place. It is not a new idea. Temporal, Restate, DBOS, Inngest, and Azure Durable Task have been selling it for years. What is new in 2026 is that every serious agent framework has quietly admitted durable execution is table stakes: LangGraph now ships with a PostgresSaver checkpointer, the OpenAI Agents SDK exposes a resume primitive, Anthropic's Managed Agents runs on an internal durable substrate. If your agent architecture still rests on a Celery queue and optimism, you are solving in 2026 a problem the rest of the industry stopped pretending to ignore in 2024.

This post is about the architectural seam between a stateless LLM and the stateful workflow engine that has to wrap it. The seam is where reliability lives, and it is where most teams are currently writing bugs.

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

April 23, 2026 · 11 min read

Tian Pan

Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

About Tian Pan