Skip to main content

130 posts tagged with "agents"

View all tags

Clarification Budgets: When Your Agent Should Ask Instead of Guess

· 10 min read
Tian Pan
Software Engineer

The two worst agent failure modes feel like opposites, but they originate from the same broken policy. The first agent asks four follow-up questions before doing anything and trains its users to abandon it. The second agent never asks, confidently produces output the user has to redo, and trains its users to mistrust it. Same policy, different settings of one missing parameter: the cost of a question relative to the cost of a wrong answer.

Most agents do not have a policy at all. The model is asked to "be helpful" and is left to negotiate ambiguity on its own. Because next-token prediction rewards committing to an answer, the agent leans toward guessing. Because RLHF rewards politeness, the agent occasionally over-corrects and asks a question for safety. The result is unprincipled behavior that varies from session to session, with no team-level intuition about when the agent will pause and when it will charge ahead.

A clarification budget is the missing parameter. It is a per-task allowance for how much friction the agent is permitted to impose, paired with a decision rule for when a question is worth spending that budget on. Think of it as the conversational analog of a latency budget — every product has one, even if no one wrote it down, and the team that writes it down stops shipping confused agents.

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

Tool Call Ordering Is a Partial Order, Not a Set

· 10 min read
Tian Pan
Software Engineer

A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.

This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.

The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.

Abstention as a Routing Decision: Why 'I Don't Know' Belongs in the Router, Not the Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams handle abstention with a single sentence in the system prompt: "If you are not confident, say you don't know." The model occasionally honors it, frequently doesn't, and the failure mode is asymmetric. A confidently-wrong answer ships at full velocity — it lands in the user's hands, gets quoted in a Slack thread, gets cited in a downstream summary. An honest abstention triggers a customer-success escalation because the user expected the agent to handle the request and now somebody has to explain why it didn't. Six months in, the team has learned which kind of failure costs less to ship, and the system prompt edit that nominally controls abstention has been quietly tuned for compliance, not for honesty.

The discipline that fixes this isn't a better wording. It's recognizing that abstention is a routing decision, not a prompt pattern. It deserves a first-class output channel, its own SLO, its own evaluation harness, and its own place in the system topology — somewhere outside the prompt, where it can be tested, owned, and scaled.

The Agent Degraded-Mode Spec Is the Document You Didn't Write

· 11 min read
Tian Pan
Software Engineer

When the search index goes stale, the vendor API throttles you, the database read replica falls behind, or a downstream microservice starts returning 503s, your agent has to decide what to do. In most production agent systems, that decision was never made. It was inherited — silently — from whatever the engineer who wrote the tool wrapper happened to type at 4 PM on a Tuesday in week three of the project.

The result is what your customers eventually write for you: a Reddit thread, a support transcript, a quote in a press article. "The assistant told me my balance was $0 when my account was actually fine — turns out their lookup service was down." That paragraph is the degraded-mode spec your team didn't write. It is now public, it is now the customer's, and it is the version your engineering org will spend the next quarter responding to.

Agent Disaster Recovery: When Working Memory Dies With the Region

· 12 min read
Tian Pan
Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.

Agent Incident Forensics: Capture Before You Need It

· 11 min read
Tian Pan
Software Engineer

The customer sends a screenshot to support on a Tuesday. Their account shows a refund posted six days ago that they never asked for. Your CRO forwards the screenshot with one question: "What produced this?" You know an agent did it — the audit log says actor: refund-agent-v3. But the prompt has been edited four times since. The model id rotated last Thursday when finance switched providers to chase a 12% cost cut. The system prompt is templated from three retrieved documents, and the retrieval index was reindexed Monday. The conversation history was trimmed by the runtime to fit a smaller context window.

You can tell the CRO the agent did it. You cannot tell them why. That gap — between knowing an action happened and being able to reconstruct the inputs that caused it — is the gap most agent teams discover the first time someone outside engineering asks a real forensic question.

Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression

· 10 min read
Tian Pan
Software Engineer

The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.

The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.

A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.

The AI Risk Register: What Your CRO Will Demand the Morning After

· 12 min read
Tian Pan
Software Engineer

The morning after the first six-figure agent incident, the directors will not ask whether the model was state-of-the-art. They will ask to see the row in the risk register that named this scenario, the owner who signed off, and the date the board last reviewed it. If your enterprise risk register has lines for cyber, vendor, regulatory, and operational risk, but no row for "an autonomous agent took an action under our credentials that produced a customer-visible loss," you are about to spend a board meeting explaining why the artifact every other category of risk merits did not exist for the one that just lost you money.

This is not a hypothetical anymore. Gartner projects that more than a thousand legal claims for harm caused by AI agents will be filed against enterprises by the end of 2026. AI-related risk has moved from tenth to second on the Allianz Risk Barometer in a single year. Insurers are now asking, in D&O renewal questionnaires, how the board has integrated AI into the corporate risk register and how third-party agentic exposures are being tracked. The line items below are what a defensible answer looks like, and the cadence the AI feature owner has to defend them on.

Argument Hallucination Is a Drift Signal, Not a Model Bug

· 10 min read
Tian Pan
Software Engineer

The ticket says "model hallucinated a user ID." The triage label is model-quality. The fix is one more sentence in the system prompt. Six weeks later a different tool starts hallucinating a date format, and the loop runs again. After a year of this, the prompt has grown into a 4,000-token apology for the entire backend, and the team is convinced the model is just unreliable on tool arguments.

The model isn't unreliable. The model is a contract-conformance machine reading the contract you gave it — and the contract you gave it has been quietly drifting away from the contract on the other side of the wire. Most production "argument hallucinations" are not model failures. They are integration tests your tool description is silently failing, surfacing as model output because that is the only place in the stack where the divergence becomes visible.

The Idle Agent Tax: What Your AI Session Costs While the User Is in a Meeting

· 11 min read
Tian Pan
Software Engineer

A developer opens their IDE copilot at 9:00, asks it three questions before standup, and then sits in meetings until 11:30. The chat panel is still open. The conversation is still scrollable. The model hasn't generated a token in two and a half hours. And yet that session — sitting there, attended by nobody — has been quietly accruing cost the entire morning. KV cache pinned. Prompt cache being kept warm by a periodic ping. Conversation state held in a hot store. Trace pipeline writing one row per heartbeat. Concurrency slot reserved on the model provider. Multiply by ten thousand seats and the bill is real.

This is the idle agent tax. It is the part of your inference budget that pays for capacity your users are not using, and it is invisible to most engineering dashboards because the dashboards were built for stateless APIs. A request comes in, a response goes out, the box closes. Done. Agentic products broke that model two years ago and most teams have not yet repriced their architecture around it.

Retries Aren't Free: The FinOps Math of LLM Retry Policies

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter found a $4,200 line item on their inference invoice that nobody could explain. The dashboard showed normal traffic. The latency graphs were flat. The cause turned out to be a single agent stuck in a polite retry loop for six hours, replaying a 40k-token tool chain with exponential backoff that capped out at thirty seconds and then started over. The retry policy was lifted verbatim from an internal SRE handbook written in 2019 for a JSON-over-HTTP service. It worked perfectly. It worked perfectly for the wrong system.

This is the bill that does not show up in capacity-planning spreadsheets. The retry-policy patterns the industry standardized on for stateless REST APIs assume three things that LLM workloads quietly violate: failures are transient, the cost of one extra attempt is bounded, and a retry has a meaningful chance of succeeding. Each assumption was load-bearing. Each one is now wrong, and the variance the cost model never captured is sitting at the bottom of every monthly invoice.

The teams that have not rebuilt their retry policy for token economics are paying a hidden tax that scales with the difficulty of the queries they were already most worried about — the long ones, the agentic ones, the ones with deep tool chains. The retry budget that classical resilience engineering hands you back as a safety net is, in an LLM stack, the rope.