Skip to main content

191 posts tagged with "agents"

View all tags

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

· 11 min read
Tian Pan
Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

· 11 min read
Tian Pan
Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces

· 10 min read
Tian Pan
Software Engineer

Every AI feature launch has the same quiet moment before the first user sees it: someone on the team asks "how do we know this is good?" and the honest answer is "we don't, yet." You have no traces because you have no users. You have no users because you haven't shipped. The loop is real, and the two failure modes it produces are both fatal — ship blind and let the first week of escalations be your eval dataset, or wait for "real data" and watch the roadmap slide for a quarter while a competitor publishes a demo.

The way out is not to pretend cold-start evaluation is the same problem as post-launch evaluation with a smaller sample size. It isn't. You are not sampling a distribution; you are constructing a prior. Every day-1 signal is an artifact of a choice you made about what to measure, whose behavior to simulate, and which failures to care about. Teams that ship AI features well treat the pre-launch eval stack as a first-class deliverable — not a spreadsheet hacked together the night before the gate review, but a layered system of dogfooding, simulation, expert annotation, and adversarial probes, each contributing a different kind of signal and each weighted with an explicit story about what it can and cannot tell you.

Conversation History Is a Liability Your Prompt Never Admits

· 10 min read
Tian Pan
Software Engineer

Read your product's analytics the next time a user says "the AI got dumber today." Filter to sessions over twenty turns. You will find the same U-shape every time: early turns score well, middle turns score well, late turns fall off a cliff. The prompt hasn't changed. The model hasn't changed. What changed is that every one of those late turns is carrying a payload of user typos, false starts, model hedges, corrections that were later reversed, tool outputs nobody re-read, and the fossilized remains of a goal that the user abandoned on turn four. Your prompt template treats this sediment as signal. The model does too. It shouldn't.

Chat history is not free context. It is a liability you are paying to re-send on every turn, and the dirtier it gets, the more it corrupts the answer you are billing the user for. The chat metaphor is the source of the confusion. Chat interfaces habituate users and engineers to treat the transcript as sacred — scrollable, append-only, never reset. That habit is imported wholesale into LLM applications even though it has no physical basis in how models process context. The model is stateless. The transcript is just a string you chose to grow. You can shrink it. You often should.

Durable Agents: Why Async Queues Break for Long-Running AI Workflows

· 11 min read
Tian Pan
Software Engineer

An agent that works 95% of the time per step is not a 95% reliable agent. Chain twenty steps together and the end-to-end completion rate drops to 36%. This is the arithmetic most teams discover only after their agent hits production, and it is the reason so many "working" prototypes stall the moment real traffic arrives. The fix is not better prompts or bigger models. It is a boring piece of distributed systems infrastructure most AI teams try to avoid until the third outage forces their hand.

The infrastructure is durable execution — the discipline of making a multi-step workflow survive crashes, restarts, and partial failures without losing its place. It is not a new idea. Temporal, Restate, DBOS, Inngest, and Azure Durable Task have been selling it for years. What is new in 2026 is that every serious agent framework has quietly admitted durable execution is table stakes: LangGraph now ships with a PostgresSaver checkpointer, the OpenAI Agents SDK exposes a resume primitive, Anthropic's Managed Agents runs on an internal durable substrate. If your agent architecture still rests on a Celery queue and optimism, you are solving in 2026 a problem the rest of the industry stopped pretending to ignore in 2024.

This post is about the architectural seam between a stateless LLM and the stateful workflow engine that has to wrap it. The seam is where reliability lives, and it is where most teams are currently writing bugs.

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

Interview Mode vs. Task Mode: The Unspoken Contract Your Agent Keeps Breaking

· 11 min read
Tian Pan
Software Engineer

Open any agent's customer feedback channel and you will find two complaints, both loud, both common, and both blamed on the model. The first sounds like "it asks too many questions before doing anything." The second sounds like "it just runs off and does the wrong thing without checking with me." Product teams hear those as opposite problems and ship opposite fixes — tighten the system prompt to ask fewer questions, then loosen it again next quarter when the other complaint gets louder. Neither change works for long, because the two complaints are not really about questions or actions. They are about a contract the user picked silently and the agent failed to honor.

Every conversation with an agent operates in one of two implicit modes. Interview mode is the contract where the user expects the agent to extract requirements before doing anything substantive — clarifying questions are welcome, premature execution is the failure. Task mode is the contract where the user has already done the thinking, has a specific plan in mind, and expects the agent to execute on the available context, asking only when truly blocked — questions are friction, half-baked execution is the failure.

Users do not announce which mode they are in. They expect the agent to read it from the message, the conversation history, and the situation, and they punish the agent harshly when it gets it wrong. The fix for "asks too many questions" and "didn't ask enough questions" is the same fix: make mode a first-class concept in your agent, detect it from signals you can actually see, and surface it to the user when you are unsure.

Mid-Flight Steering: Redirecting a Long-Running Agent Without Killing the Run

· 10 min read
Tian Pan
Software Engineer

Watch a developer use an agentic IDE for twenty minutes and you will see the same micro-drama play out three times. The agent starts a long task. Two tool calls in, the user realizes they want a functional component instead of a class, or a v2 endpoint instead of v1, or tests written in Vitest instead of Jest. They have exactly one lever: the red stop button. They press it. The agent dies mid-edit. They copy-paste the last prompt, append the correction, and pay for the first eight minutes of work twice.

The abort button is the wrong affordance. It treats "I want to adjust the plan" and "I want to throw away the run" as the same gesture. In practice they are as different as a steering wheel and an ejector seat, and conflating them is why so many agent products feel brittle the moment a task takes longer than a single screen of output.

Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

· 9 min read
Tian Pan
Software Engineer

Your agent hits 94% on the eval suite. Your on-call rotation is on fire. Nobody in the room is lying; both numbers are honest. What's happening is that the harness is testing a prompt, and production is testing an agent, and those are two different artifacts that happen to share weights.

Mocked-tool evals are almost always how this gap opens. You stub search_orders, charge_card, and send_email with canned JSON, feed the model a user turn, and assert on the final response. The run is cheap, deterministic, and reproducible — every property a CI system loves. It is also silent on tool selection, latency, rate limits, partial failures, and retry behavior, which is to say silent on the set of failures that dominate post-incident reviews.

Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure

· 13 min read
Tian Pan
Software Engineer

The spreadsheet on the oncall doc said the search tool had a 2% error rate. The incident review said the agent platform had a 20% failure rate during the three-hour window. Nobody disagreed with either number. The search team was not at fault. The platform team did not ship a bug. The gap between the two numbers is the whole story, and it is a story about arithmetic, not engineering incompetence.

Retry logic is one of the most borrowed and least adapted patterns in agent systems. Teams copy tenacity decorators from their REST client, stack them at the SDK, the gateway, and the agent loop, and ship. Each layer is individually reasonable. The composition is a siege weapon pointed at the flakiest dependency in the fleet, and it fires hardest at the exact moment that dependency needs the load to drop.

This post is about how that math works, why agent loops amplify it harder than request-response systems, and the retry discipline that keeps transient blips from becoming correlated outages with your own logo on them.

Sampling Bias in Agent Traces: Why Your Debug Dataset Silently Excludes the Failures You Care About

· 9 min read
Tian Pan
Software Engineer

The debug corpus your team stares at every Monday is not a representative sample of production. It is an actively biased one, and the bias is in exactly the wrong direction. Head-based sampling at 1% retains the median request a hundred times before it keeps a single rare catastrophic trajectory — and most teams discover this only when a failure mode that has been quietly recurring for months finally drives a refund or an outage, and they go looking for examples in the trace store and find none.

This is not an exotic edge case. It is the default behavior of every observability stack that was designed for stateless web services and then pointed at a long-horizon agent. The same sampling math that worked fine for HTTP request tracing systematically erases the trajectories that matter most when each "request" is a thirty-step plan that may invoke a dozen tools, regenerate three subplans, and consume tens of thousands of tokens before something subtle goes wrong on step twenty-seven.

The fix is not "sample more." Sampling more makes the bill explode without changing the bias — you just get more of what you already had too much of. The fix is to change what you sample, keyed on outcomes you can only know after the trajectory finishes. That requires throwing out the head-based defaults and rebuilding the retention layer around tail signals, anomaly weighting, and bounded reservoirs that survive the long tail of agent execution.

Spec-First Agents: Why the Contract Has to Land Before the Prompt

· 11 min read
Tian Pan
Software Engineer

The prompt for our support agent was 2,400 tokens when I inherited it, and the engineer who wrote it had left the company. Every instruction in it was load-bearing for some production behavior, but nobody could tell me which ones. A bullet about "always restate the user's problem before answering" looked like filler until we deleted it and CSAT dropped four points in a week. The prompt was, it turned out, the specification. It was also the implementation. It was also the test suite — implicit, unwritten, living only in the head of an engineer who no longer worked there.

That is the endgame of prompt-as-spec. The prompt is both what the agent should do and how it does it, and the two become indistinguishable the moment the prompt outgrows a single author. You can't refactor it because you don't know which lines encode requirements and which encode mere hints. You can't review a change because there's no artifact the change is a diff against. You can't onboard anyone to own it because ownership means "having read the whole thing recently and remembering why each clause is there," which is a six-month investment nobody authorizes.

Spec-first flips the ordering. The contract — inputs, outputs, invariants, error cases, refusal semantics, escalation triggers — is a first-class artifact that precedes the prompt and constrains every revision. Prompt edits become diffs against a spec, not rewrites of the spec itself. The shift sounds bureaucratic until you see what it unlocks: evals that come from the spec rather than the other way around, reviews that take minutes instead of afternoons, and the eventual ability to let a new engineer own the surface without a six-month apprenticeship.