Skip to main content

161 posts tagged with "agents"

View all tags

Where the 30 Seconds Went: Latency Attribution Inside an Agent Step Your APM Can't See

· 11 min read
Tian Pan
Software Engineer

The dashboard says agent.run = 28s at p95. Users say the feature feels broken. The on-call engineer opens the trace, sees a single fat bar with no children worth investigating, and starts guessing. By the time someone has rebuilt enough mental model to know whether the bottleneck is the model, the retriever, or a tool call that nobody added a span to, the incident has aged into a backlog ticket and the user has given up.

This is the failure mode at the heart of agent operations in 2026: classical APM treats an agent step as a black box, and "agent latency" is not a metric — it is the sum of seven metrics that decompose the wall-clock time differently depending on what the agent decided to do that turn. A team that doesn't expose those seven numbers ships a feature whose slowness everyone can feel and nobody can fix.

AI Cyber Insurance: The Coverage Gap Your Agent Will Find First

· 11 min read
Tian Pan
Software Engineer

A coding agent merges a change at 2 a.m. that takes a customer's production database offline for ninety minutes. A customer-support agent fans out and sends fourteen thousand misworded refund-denial emails before the loop is killed. An autonomous reconciliation workflow charges 2,800 cards twice. The damages are real, the audit trail names your company, and your finance team files the claim against the cyber policy that was renewed six weeks ago. The carrier's response is a polite letter explaining that the policy covers "unauthorized access by malicious third parties" and "social engineering of an employee" — and the agent was authenticated, the action was authorized, and no employee was deceived. Coverage denied. The loss sits on your balance sheet.

This is not a hypothetical edge case. It is the modal claim profile for the next eighteen months, and the insurance industry knows it. Cyber, E&O, and D&O policy language was calibrated against a threat model where breach severity is a function of records exfiltrated and incident response is a function of forensic hours billed. Agentic AI does not produce that shape of incident. It produces a shape the underwriter has no actuarial baseline for, and the carrier's first instinct — when the actuarial baseline is missing — is to write the exposure out of the policy entirely.

Calibrated Abstention: The Capability Every Layer of Your LLM Stack Punishes

· 11 min read
Tian Pan
Software Engineer

There is a capability your model could have that would, on the days it mattered, be worth more than any other behavioral upgrade you could ship: the ability to say "I don't have a reliable answer to this" and mean it. Not the keyword-matched safety refusal. Not the hedging tic the model picked up from RLHF on controversial topics. The real thing — a calibrated abstention that fires when, and only when, the model's internal evidence does not support a confident response.

You will never get it by accident. Every default in the LLM stack pushes the other way.

Cancel-Safe Agents: The Side Effects Your Stop Button Already Shipped

· 11 min read
Tian Pan
Software Engineer

A user clicks Stop because the agent misunderstood the request. The UI flashes "stopped." By the time the spinner disappears, the agent has already sent two emails, scheduled a Tuesday meeting on the user's calendar, opened a draft pull request against the wrong branch, and queued a Slack message that is racing the cancellation signal through the tool layer. The model has obediently stopped generating tokens. The world has not stopped reacting to the tokens it generated thirty seconds ago.

This is the failure mode nobody covered in the agent demo. Cancellation in synchronous code was already a hard problem with a generation of cooperative-cancellation theory behind it: Go contexts, Python's asyncio.cancel, structured concurrency with task groups, the whole grammar of "ask politely, escalate carefully, don't leave resources behind." Agents take that already-hard problem and add a layer on top: the planner does not know that the user revoked authorization between step 4 and step 5, and the tools it kicked off in step 4 do not get a memo when step 5 is cancelled. Stop is a UI affordance. The system underneath stop has to be designed.

Chat History Is a Database. Stop Treating It Like Scrollback.

· 11 min read
Tian Pan
Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

The Wiki Has a Second Tenant: Why Docs for AI Agents Are Different from Docs for Humans

· 10 min read
Tian Pan
Software Engineer

A senior engineer at a mid-sized SaaS company spent two days last quarter chasing a deployment bug that turned out to be the agent's fault. The agent had read a runbook last updated in 2023, faithfully followed step three, and ran a command that no longer existed in the deploy tooling. The runbook still rendered fine in the wiki — the screenshots were even still legible — but it had silently become hostile to a reader who couldn't tell that the surrounding context was stale. The human authors had no idea the doc was now a load-bearing input for every new hire's AI assistant.

This is the quiet shift that has happened in most engineering orgs over the past eighteen months: the internal wiki has accumulated a second audience. The same Confluence pages, the same architecture diagrams, the same "how we deploy" gists are now being read by two distinct consumers — the engineers themselves and the AI assistants their engineers use. The two readers consume the same words under entirely different constraints and produce systematically different failure modes when the docs were written with only the first one in mind.

The Inverted Agent: When the User Is the Planner and the Model Is the Step-Executor

· 12 min read
Tian Pan
Software Engineer

Most agent products today implement a simple bargain: the model decides what to do, the user clicks "approve." This is the right shape for low-stakes consumer chat — booking a restaurant, summarizing an inbox, drafting a casual reply. It is catastrophically wrong for legal drafting, financial advisory, medical triage, and incident response, where the user holds the accountability the model never can, and where the cost of the wrong plan dwarfs the cost of any individual step.

The inverted agent flips the polarity. The user composes the plan as a sequence of named, reorderable steps. The model executes each step on demand — with full context, with tool access, with reasoning — but never decides what step comes next. The model can suggest, but suggestions are advisory, not autonomous. This is not a worse autonomous agent; it is a different product, with a strictly worse cost-and-latency profile and a strictly better trust profile, aimed at users who would otherwise decline to adopt the autonomous version at all.

The mistake teams keep making is treating "autonomy" as a default to push toward. It is a UX axis you choose per-surface. Get the polarity wrong and you ship a feature your highest-stakes users will quietly refuse to touch.

Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding

· 12 min read
Tian Pan
Software Engineer

A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.

Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.

This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.

Reachability Analysis for Agent Action Spaces: Eval Coverage for the Branches You Never Tested

· 12 min read
Tian Pan
Software Engineer

The first time anyone on your team learned that the agent could call revoke_api_key was the morning a well-meaning user typed "this token feels old, can you rotate it for me?" The tool had been registered six months earlier as part of a batch import from the auth team's MCP server. It had passed schema validation, appeared in the catalog enumeration, and then sat. No eval ever invoked it. No production trace ever touched it. Then one prompt, one planner decision, and the incident channel learned the tool existed.

This is the failure mode that hides inside every agent with a non-trivial tool catalog. Forty registered functions and a planner that can compose them produce a reachable graph of plans whose long tail you have never observed. The assumption that "we tested the common paths" papers over the fact that the dangerous branch is, almost by definition, the one you never saw.

Replan, Don't Retry: Why Most Agent Errors Aren't Transient

· 10 min read
Tian Pan
Software Engineer

A calendar-write returns 409 Conflict. The framework's default error handler kicks in: backoff 200ms, retry. Same conflict. Backoff 400ms, retry. Same conflict. Backoff 800ms, retry. By the time the agent gives up and tells the user "I couldn't book the meeting," it has burned three seconds of latency budget proving something the very first response already told it: the slot is taken. The world has not changed. It will not change in 800 milliseconds. Retrying was never going to work, because nothing about this error was transient.

This is the most common error-handling bug in agent systems, and it is hiding in plain sight inside almost every framework that ships today. The retry-with-exponential-backoff pattern was imported wholesale from stateless HTTP clients — where it is exactly correct — into stateful planning loops where it is actively wrong. The right default for a tool error in an agent is not retry. It is replan.

Sampling Parameter Inheritance: When Temperature 0.7 Leaks From the Planner Into the Verifier

· 10 min read
Tian Pan
Software Engineer

A verifier that flips its own answer eight percent of the time is not a flaky model. It is a sampling configuration bug that reached production because the framework defaulted to inheritance. The planner needed temperature=0.7 to brainstorm subtask decompositions. The verifier — the role whose entire job is to give a low-variance yes-or-no on whether the answer satisfies the rubric — was instantiated through the same harness call, and silently picked up the same temperature. Nobody set it that way on purpose. Nobody set it at all.

This is the most expensive parameter in your stack that nobody owns. It compounds across the call tree: the summarizer above the verifier, the structured-output extractor below it, and the retry loop wrapping the whole thing all consume the planner's "be creative" knob as if it were a global. The bill arrives in three places at once — eval flakiness, token spend, and the half-day a senior engineer spends bisecting a regression that turns out to be no regression at all.