Skip to main content

311 posts tagged with "ai-agents"

View all tags

The Postmortem Template With No Row for the Model's Inference

· 11 min read
Tian Pan
Software Engineer

The first time an agent caused a real outage on my team, the postmortem author opened the template, scrolled past the timeline, stared at the "Root Cause" field for a long minute, and typed: "The runbook for queue-stuck recovery was incorrect." The runbook was fine. The agent had read the runbook, decided the queue's symptoms matched a different scenario, and run a recovery script for that other scenario instead. The action items that came out of that document — "tighten the runbook wording," "add a confirmation prompt to the recovery script" — were entirely useless against the actual failure mode, which was that an inferential system had inferred wrong and there was no field in the template that knew how to say so.

I've watched this exact failure repeat across teams since. The template is calibrated for deterministic systems. Code did the wrong thing, so you fix the code. Config was misset, so you fix the config. The schema of the postmortem document is the schema of the team's theory of failure, and when that theory cannot represent "the agent's plan was wrong," the document flattens the actual failure into the closest thing the template can represent — usually a documentation gap or a missing guardrail — and the action items chase a deterministic fix for a probabilistic failure. The same incident class then recurs, and the team writes it up the same way the next time.

The Rate Limit You Set for Humans an Agent Saturates in Three Seconds

· 10 min read
Tian Pan
Software Engineer

The rate limit was never a fairness primitive. It was a sales-engineering quote that grew up — a number a solutions engineer typed into a docs page during onboarding three years ago, copied into a tier definition, and never revisited because no one ever hit it. The limit said "100 requests per minute" and it meant "more than any sane integration will ever need," because every integration on the platform was a backend service driven by a human at a keyboard, and humans do not type a hundred times a minute.

Then a paying tenant pointed an agent at the endpoint. The agent did not type. It did not pause to read responses. It did not have a UI to render between requests. It executed a planning loop that called the API once per reasoning step, and one reasoning step took the model about thirty milliseconds of wall time to formulate. The agent hit the per-minute ceiling in three seconds, the per-hour ceiling in three minutes, and the daily quota before the on-call engineer's coffee had cooled. The support escalation landed before the throttle dashboard had updated.

The Recurring Task Your Agent Scheduled With Nobody To Inherit

· 9 min read
Tian Pan
Software Engineer

A user types "remind me every Tuesday to check that integration." The agent creates a cron entry, returns a polite confirmation, and the session closes. Six months later the user has changed teams. The integration was deprecated last quarter. The cron is still firing, hitting an API key that was rotated in April, into a Slack channel that was archived in May, charged to a project budget that nobody reviews. The agent did exactly what was asked. The asking is what aged badly.

This is not a bug in any particular agent. It is the shape of a category. The moment we gave agents the ability to schedule durable side effects — cron jobs, webhooks, polling loops, workflow triggers, calendar invites, recurring queries — we created a class of infrastructure that is born without a lifecycle. The create primitive is loud and easy. The delete primitive, the audit primitive, the inheritance primitive — they don't exist on equal footing, so they don't get used.

The cost is invisible until you go looking, which is exactly when nobody is looking.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

The Abstention Tax You Didn't Budget For

· 11 min read
Tian Pan
Software Engineer

You taught the agent to say "I don't know" when the context was thin and called it a safety win. The OpenAI bill went down. Everyone agreed it was the responsible move. Three months later your VP of Support is asking why headcount projections are off by 40% and nobody in the AI org has an answer, because the metric you tracked was abstention rate and the metric that moved was tickets-per-week — and nobody owned the line that summed them.

This is the abstention tax. It's not a model cost. It doesn't show up on the inference invoice. It shows up downstream, in the queue depth of the human team that catches every "I cannot answer," in the second model call that runs against the enriched context the human had to assemble, in the customer who churned during the wait. The model-only cost frame quietly hides it. And the org seam where the AI team owns abstention and the ops team owns the queue means nobody is incentivized to see it.

The Agent That Burned an Hour Re-Trying the Same 400 Error

· 10 min read
Tian Pan
Software Engineer

An agent calls a tool. The tool returns 400 Bad Request with a perfectly structured error body: {"error": "missing required field", "field": "email"}. The agent's reasoning chain quotes the error back word for word, says "I need to include the email field," and then emits the next tool call with the exact same payload as before. The loop's retry policy, written years ago for a world of flaky HTTPS connections, treats the 400 as transient and fires it again. Three times. Eight. Fifteen. An hour later, the agent has burned through context, money, and a rate-limit budget without ever sending a request the tool could accept.

The failure looks like a model problem. It is not. The model read the error correctly. The harness around it never gave the correction anywhere to land.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The AI Standup Where Yesterday's Status Is a Lie

· 9 min read
Tian Pan
Software Engineer

The team meets at 10am. The first engineer reports what their agents finished overnight. Except the eval suite that kicked off at 7am hasn't returned, the PR the agent opened at 3am is waiting on a review from another agent whose queue depth is unknown, and the long-running refactor agent is on hour eleven of an estimated four-hour run with no signal that it's stuck and no signal that it's healthy. Yesterday's status is not "done" and not "in progress." Yesterday's status is unknowable from inside the room.

The standup was a synchronous ritual built for synchronous human work. Each person did a thing, finished it, slept on it, and reported it the next morning. The unit of work was a workday. The unit of reporting was a person. The cadence matched the substrate. None of that holds anymore. The unit of work is now an agent run that started before you went to bed and may finish during the meeting or three hours after. The unit of reporting is a fleet, not a person. And the cadence — a 9- to 15-minute round-robin at 10am sharp — is a frequency the substrate doesn't produce events on.

The ChatOps Bot That Mistook Silence for Consent

· 10 min read
Tian Pan
Software Engineer

Your deploy bot has been live for nine months. The dashboard says message volume is up and to the right. The thumbs-down rate is stable below two percent. The team that ships it interprets this as adoption. Then a staff engineer mentions, almost in passing, that everyone on his squad muted the channel back in February — they trust the bot's hourly digest about as much as they trust a vendor newsletter, and they got tired of the buzz. The bot is talking to an empty room and the metric calls that traction.

This is the failure mode most chatops teams hit and almost none of them measure. When a bot in Slack or Teams stops getting replies, the easy read is "the agent has reached a steady state — users don't need to argue with it anymore." The honest read is usually the opposite: users are routing around it, muting it, or learning that ignoring the prompt is cheaper than reading it. The engagement chart can't tell you which. The instrumentation has to be redesigned around the assumption that silence is the default and that interpreting it correctly is the whole job.

The Evidence Locker Your Agent Doesn't Keep

· 9 min read
Tian Pan
Software Engineer

Your trace logs every token. They log every tool call, every retry, every retrieval latency, every model id. They look exhaustive. Then a regulator, a customer, or your own incident channel asks the one question that should be easy: what did the model actually see at the moment it decided? And you discover that your trace recorded the questions but not the answers the model was looking at when it answered.

The retrieved chunks have rotated out of the vector store because the corpus was reindexed last Tuesday. The tool response was a streamed payload you stored only the final-state summary of, because storing the full stream tripled your bill. The system prompt was assembled at runtime from a feature flag that has since flipped twice, and your flag service does not retain historical values by timestamp. You have full observability over what happened — the call graph, the token counts, the latencies. You have nothing about what the model was answering against. That gap is the difference between a trace and a decision record, and most teams have not noticed they only built one of the two.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Redaction Layer Your Agent Cannot Reason Through

· 9 min read
Tian Pan
Software Engineer

A privacy review approves your redaction layer. Names, emails, account numbers, phone numbers — all scrubbed before the prompt reaches the model. Your single-turn classifier still hits 94% accuracy. Six weeks later your multi-step agent starts giving confidently wrong answers to questions like "is the email Sarah used to log in the same as the one on her billing record?" and nobody can reproduce it in dev.

The redaction layer did exactly what infosec asked it to do. It also quietly destroyed the property your agent's reasoning depended on: that two mentions of the same entity in different turns refer to the same thing. The agent isn't hallucinating. It's reading a transcript where Sarah has become three different people and the "same" email address has become two distinct placeholders.