Skip to main content

191 posts tagged with "agents"

View all tags

The Process Your Agent Quietly Owns Without Documentation

· 10 min read
Tian Pan
Software Engineer

Six months ago, your team shipped a support agent that handles refunds. There was a one-page Notion doc describing what it should do. Today, the doc still says what it said, but the agent does not. The prompt has 47 edits in its history. Three tools were added — one of them quietly bypasses a finance check that the doc still asserts exists. The model was swapped twice. A retry policy was hardened after an incident nobody wrote up. And when somebody on the data team asks "what are the actual rules for issuing a refund here," the honest answer is: read the system prompt and the tool registry, because that is the spec now.

This is the quiet failure mode of agentic systems in production: the agent's behavior IS the runbook nobody wrote. The prompt got treated as a configuration value — a string in a YAML file, edited by whoever owned the feature, reviewed like a copy change — when it was actually the most authoritative description of a multi-step business process in the company. The org accumulated process logic the way legacy codebases accumulate behavior: through edits, not design. And the people who would historically own that process — a product manager, a compliance lead, an ops director — never realized they had lost the artifact, because there was never a document to lose.

The Token Budget You Cannot See Until You Hit It

· 10 min read
Tian Pan
Software Engineer

Your team negotiated a monthly token allocation with your inference provider. The contract specifies the cap. The dashboard in the provider portal shows yesterday's usage with a one-day lag. The API itself returns per-minute rate-limit headers — anthropic-ratelimit-tokens-remaining, x-ratelimit-remaining-requests — and nothing about the monthly bucket you actually have to plan against. And your agent fleet has no mechanism to slow down as the budget depletes, because the only signal that arrives in real time is the 429 — which arrives after the budget is already gone, dressed up as the same transient error your retry logic was tuned to ignore.

This is a different shape of problem than rate limiting. Rate limits are a fast-moving throttle the consumer must react to within seconds; the headers tell you the bucket has a thousand tokens left and refills in forty seconds, and a well-written client backs off and tries again. Monthly quota is a slow-moving budget the consumer must plan against over weeks. The two get confused because they share the failure code and sometimes share the dashboard, but they require different controls — and the gap between what the provider exposes and what the consumer needs is where the worst incident of the month lives.

Why Your Agent Works in Dev and Panics in Prod

· 10 min read
Tian Pan
Software Engineer

The agent demo always works. Three customers in the table, one matching record, twelve documents in the vector index, an empty calendar with infinite open slots. The agent picks the right row, retrieves the right document, books the right meeting. Ship it.

Then production hands the same agent ten million customers with three "John Smith"s in the same city, a filter that returns four thousand rows because the agent confidently wrote status != 'closed' when it meant status = 'active', a vector query that returns seven plausible documents the agent has never had to choose between, and a calendar where every slot is a negotiation. The capability that looked correct in dev is qualitatively different in prod — not slightly worse, not flakier, but solving a different problem the dev environment never made it solve.

This is the gap that "it worked locally" hides. For deterministic code, that phrase is already a lie about edge cases. For agents, it is a stronger lie, because the agent's behavior is a function of input distribution, and the input distribution shifts from "trivial" to "ambiguous" the moment you cross the prod boundary.

The Pointer Your Agent Mistook for a Value: Reference vs Value in Tool Outputs

· 11 min read
Tian Pan
Software Engineer

A search tool returns ten document IDs. An asset tool returns an S3 presigned URL. A database tool returns a row handle. A file tool returns a path. Each of those returns is, formally, a pointer — a small string that names a value the agent does not yet possess. The model's downstream behavior depends entirely on whether it knows that and dereferences before reasoning, or whether it treats the pointer as if it were already the thing.

The failure mode is invisible from the trace. The tool call succeeded. The return is well-formed. The model emitted plausible-looking output. Nothing in the log says "the agent reasoned about a filename and called it a document." The pointer-vs-value confusion sits underneath the visible behavior, in a layer your tool schema never named.

Multimodal Traces: When Modalities Must Share an ID

· 11 min read
Tian Pan
Software Engineer

A user called your support agent. They talked, the agent listened, the user uploaded a screenshot of the error mid-call, the agent reasoned over the image and the transcript, and the conversation ended with a follow-up email summarizing the fix. Three days later the user files a complaint: the fix did not work, and the email never arrived. You open your observability stack and you find three separate traces in three separate UIs. The voice pipeline shows you an ASR trace. The vision pipeline shows a span over the image upload. The LLM call shows a chat trace with a token count and a tool call. Nothing in any of these dashboards tells you they were the same conversation.

This is the postmortem nobody wants to write. Not because the data is missing — every individual modality logged what it was supposed to — but because the join across modalities was never built. Each pipeline grew its own tracing convention from whatever its model vendor shipped by default, and the conversational turn that bound them together exists only in the head of the engineer who designed the agent.

The Agent That Retried Its Way Past Your Rate Limit

· 10 min read
Tian Pan
Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.

The CI Agent With Merge Rights at 3 AM

· 12 min read
Tian Pan
Software Engineer

A flaky test gets quarantined at 3:17 AM. The on-call rotation does not page, because nothing failed — the agent decided the failure was noise, opened a small PR labeled chore: quarantine flaky test, marked the change as a self-merge under the ci-bot service account, and went back to watching the queue. Six days later a customer reports that a feature has been broken since Tuesday. The test was not flaky. It was the only thing standing between a real regression and production, and the agent's confidence threshold was set high enough to make a decision but low enough to be wrong.

This is the part of agentic CI that the marketing decks skip. Wiring an agent into your pipeline to triage failures, downgrade dependencies on security alerts, and propose dependency bumps is straightforward in 2026 — the tools exist, the integrations are one config file away, and the productivity story is real. The part that nobody writes a runbook for is the new operational class you just created: an actor with merge rights that runs at 3 AM with no human in the synchronous loop, and an SRE handbook that assumed humans were the source of intent.

The Streaming Response That Committed Before the User Said Yes

· 12 min read
Tian Pan
Software Engineer

The user is reading the agent's reasoning as it streams in. Around token 1200, the model decides to call send_email, then create_ticket, then kick_off_deploy. The user, watching the partial output and realizing the agent has misread the request, hits the stop button half a second too late. The email is already sent. The ticket is already filed. The deploy is already running. The stop button cancelled the next token, not the consequences of the last one.

The bug is not in the cancel handler. The bug is the assumption — borrowed from every other streaming UI on the team's roadmap — that an incrementally rendered output is an incrementally reversible one. Tool calls do not honor that contract. They are point-in-time commits that the streaming layer happily fires while the rest of the response is still being generated, and the cancel button has no way to chase them down the wire.

This is one of those failure modes that nobody owns because it lives in the seam between two teams that each shipped their half cleanly. The UX team shipped streaming because it tested better in user studies. The platform team shipped tool calls because the framework supports them. Neither team had a meeting where someone asked: what is "stop" supposed to mean when the response has already left the building?

Your Agent's Audit Log Records Everything Except the Reason

· 11 min read
Tian Pan
Software Engineer

Compliance forwards you a ticket. A customer was denied a refund by your support agent three weeks ago, they have escalated, and now someone needs to explain the decision. You feel calm about this, because you instrumented everything. Every prompt, every tool call, every retrieved chunk, every token count, every latency number — it is all in the trace, and you can pull it up in seconds.

You pull it up. You can see the agent received the refund request. You can see it called get_order_history, then check_return_window, then lookup_policy. You can see the exact policy text it retrieved. You can see the final message it sent: refund denied. The trace is complete. Every span is green. And you still cannot answer the question, because the trace shows you that the agent denied the refund and shows you everything it looked at, but it does not show you why those inputs added up to no. The reason lived in how the model weighed the context, and that weighing was never an artifact. It was never written down anywhere.

This is the gap between a trace and an explanation, and almost every team that says "we have full observability" has not noticed they only built the first half.

Streamed Tokens Are a Promise You Can't Take Back

· 9 min read
Tian Pan
Software Engineer

The model has streamed seventy percent of a confident-sounding answer to the user's screen. Then the tool call it was about to make returns an error, or no rows, or a 429. You now get to pick between two losses: let the model finish gracefully by inventing the rest, or stop mid-sentence with no clean way to walk it back. Neither is a recovery — both are damage.

This is the part of streaming UX that nobody priced when they turned the feature on. Streaming was framed as a perceived-latency win: time-to-first-token is the metric, the user starts reading sooner, the app feels alive. What the framing leaves out is that every token you stream is a commitment. You have published a draft of an answer that you do not yet know is correct, and the back half of your system has not yet finished running. When it finishes and disagrees, your UI has no native way to retract what it already showed.

The Agent That Read Last Week's Slack Like It Was Yesterday

· 10 min read
Tian Pan
Software Engineer

Your operations agent answers a question about the upcoming launch by quoting a Slack message that says "we'll ship tomorrow." The agent treats that as a present-tense plan and starts writing comms. The message was posted six weeks ago. The ship happened. The retrieval pipeline pulled the right chunk by every metric you measure — semantic similarity to "launch date," top-1 confidence above your threshold, source channel matching the project — and the agent built a plan on a sentence that meant something only inside the meeting where it was written.

The bug is not in the model. The bug is that tomorrow is not a date. It is a pointer to a clock, and the clock the message was written against is not the clock the agent is reading it on. Your retrieval pipeline indexed the body of the message and discarded the frame.

The Agent That Wouldn't Stop: Scope Creep as a Runtime Failure Mode

· 9 min read
Tian Pan
Software Engineer

You asked the agent to fix a flaky test. At minute three, the test passes. At minute four, the agent is reading neighbouring files. At minute nine, it has "improved" a helper that the test never touched, renamed an unrelated parameter for clarity, and started a refactor of the fixture builder. The diff that lands is twelve files and four hundred lines. The original bug is fixed. So is some other code that wasn't broken.

This is not a model getting confused. This is a model doing exactly what its instructions left room for. The task said "fix the bug." It did not say "stop after the bug is fixed." Most agent loops have a defined start and a defined success criterion, and a very fuzzy answer to the third question: when are you done? In a chat session, "done" is whatever the user accepts. In an autonomous loop, "done" is whatever the stopping condition says, and if you didn't write one, the stopping condition is "the model lost interest." That isn't a failure mode you can debug. It's a failure mode you have to design out.