Skip to main content

320 posts tagged with "ai-agents"

View all tags

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

The Abstention Tax You Didn't Budget For

· 11 min read
Tian Pan
Software Engineer

You taught the agent to say "I don't know" when the context was thin and called it a safety win. The OpenAI bill went down. Everyone agreed it was the responsible move. Three months later your VP of Support is asking why headcount projections are off by 40% and nobody in the AI org has an answer, because the metric you tracked was abstention rate and the metric that moved was tickets-per-week — and nobody owned the line that summed them.

This is the abstention tax. It's not a model cost. It doesn't show up on the inference invoice. It shows up downstream, in the queue depth of the human team that catches every "I cannot answer," in the second model call that runs against the enriched context the human had to assemble, in the customer who churned during the wait. The model-only cost frame quietly hides it. And the org seam where the AI team owns abstention and the ops team owns the queue means nobody is incentivized to see it.

The Agent That Burned an Hour Re-Trying the Same 400 Error

· 10 min read
Tian Pan
Software Engineer

An agent calls a tool. The tool returns 400 Bad Request with a perfectly structured error body: {"error": "missing required field", "field": "email"}. The agent's reasoning chain quotes the error back word for word, says "I need to include the email field," and then emits the next tool call with the exact same payload as before. The loop's retry policy, written years ago for a world of flaky HTTPS connections, treats the 400 as transient and fires it again. Three times. Eight. Fifteen. An hour later, the agent has burned through context, money, and a rate-limit budget without ever sending a request the tool could accept.

The failure looks like a model problem. It is not. The model read the error correctly. The harness around it never gave the correction anywhere to land.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The AI Standup Where Yesterday's Status Is a Lie

· 9 min read
Tian Pan
Software Engineer

The team meets at 10am. The first engineer reports what their agents finished overnight. Except the eval suite that kicked off at 7am hasn't returned, the PR the agent opened at 3am is waiting on a review from another agent whose queue depth is unknown, and the long-running refactor agent is on hour eleven of an estimated four-hour run with no signal that it's stuck and no signal that it's healthy. Yesterday's status is not "done" and not "in progress." Yesterday's status is unknowable from inside the room.

The standup was a synchronous ritual built for synchronous human work. Each person did a thing, finished it, slept on it, and reported it the next morning. The unit of work was a workday. The unit of reporting was a person. The cadence matched the substrate. None of that holds anymore. The unit of work is now an agent run that started before you went to bed and may finish during the meeting or three hours after. The unit of reporting is a fleet, not a person. And the cadence — a 9- to 15-minute round-robin at 10am sharp — is a frequency the substrate doesn't produce events on.

The ChatOps Bot That Mistook Silence for Consent

· 10 min read
Tian Pan
Software Engineer

Your deploy bot has been live for nine months. The dashboard says message volume is up and to the right. The thumbs-down rate is stable below two percent. The team that ships it interprets this as adoption. Then a staff engineer mentions, almost in passing, that everyone on his squad muted the channel back in February — they trust the bot's hourly digest about as much as they trust a vendor newsletter, and they got tired of the buzz. The bot is talking to an empty room and the metric calls that traction.

This is the failure mode most chatops teams hit and almost none of them measure. When a bot in Slack or Teams stops getting replies, the easy read is "the agent has reached a steady state — users don't need to argue with it anymore." The honest read is usually the opposite: users are routing around it, muting it, or learning that ignoring the prompt is cheaper than reading it. The engagement chart can't tell you which. The instrumentation has to be redesigned around the assumption that silence is the default and that interpreting it correctly is the whole job.

The Evidence Locker Your Agent Doesn't Keep

· 9 min read
Tian Pan
Software Engineer

Your trace logs every token. They log every tool call, every retry, every retrieval latency, every model id. They look exhaustive. Then a regulator, a customer, or your own incident channel asks the one question that should be easy: what did the model actually see at the moment it decided? And you discover that your trace recorded the questions but not the answers the model was looking at when it answered.

The retrieved chunks have rotated out of the vector store because the corpus was reindexed last Tuesday. The tool response was a streamed payload you stored only the final-state summary of, because storing the full stream tripled your bill. The system prompt was assembled at runtime from a feature flag that has since flipped twice, and your flag service does not retain historical values by timestamp. You have full observability over what happened — the call graph, the token counts, the latencies. You have nothing about what the model was answering against. That gap is the difference between a trace and a decision record, and most teams have not noticed they only built one of the two.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Redaction Layer Your Agent Cannot Reason Through

· 9 min read
Tian Pan
Software Engineer

A privacy review approves your redaction layer. Names, emails, account numbers, phone numbers — all scrubbed before the prompt reaches the model. Your single-turn classifier still hits 94% accuracy. Six weeks later your multi-step agent starts giving confidently wrong answers to questions like "is the email Sarah used to log in the same as the one on her billing record?" and nobody can reproduce it in dev.

The redaction layer did exactly what infosec asked it to do. It also quietly destroyed the property your agent's reasoning depended on: that two mentions of the same entity in different turns refer to the same thing. The agent isn't hallucinating. It's reading a transcript where Sarah has become three different people and the "same" email address has become two distinct placeholders.

The Sandbox Your Agent Didn't Notice Was Real

· 10 min read
Tian Pan
Software Engineer

A team I know has a textbook staging setup. Read-only replicas of the production database. A mock Stripe account that pretends to charge cards. Synthetic users with fake email addresses on a domain nobody owns. The agent is asked to walk through an "account delinquent" escalation flow in staging, end to end, as part of a release rehearsal. The trace looks clean. The agent does what it is supposed to do.

Three minutes later, a real customer — a paying one, who churned six months ago and was still in a dormant export the developer had used to seed a test fixture — replies to a politely-worded payment-overdue email. The "send_email" tool, registered next to a dozen other tools that all terminate in mocks, was wired to the production Mailgun key. The developer who set it up two sprints earlier had been iterating fast on email templates and the sandbox tier capped them at five emails an hour, which broke the inner loop, so they swapped in the real key "just for the afternoon" and forgot. Nobody re-checked. The agent had no way to know.

The Typo Your Agent Learned to Honor

· 10 min read
Tian Pan
Software Engineer

An insurance carrier fine-tuned a support model on a year of chat transcripts. Within a week of launch, a compliance reviewer flagged something odd: the bot kept writing "deductable" instead of "deductible." Not occasionally — consistently, in roughly the same one-in-eight messages where the word appeared. The model had not invented the misspelling. It had inherited it. A handful of tier-1 reps had been typing it that way for two years, and the corpus reflected what they typed, not what the dictionary said.

This is the unsettling thing about supervised fine-tuning on operational data: the model is not learning your domain. It is learning your corpus. Those two things overlap, but they are not the same, and the gap is where every preventable behavioral defect lives. Frequency in your training data is not a signal of correctness. It is a signal of what your team happened to do enough times for the model to mimic it.

The misspelling is the easy case to spot. The hard cases are the ones nobody bothered to write down as rules, because everyone assumed the model would learn the "professional" version of the work rather than the actual work as performed.

The Verifier Loop That Couldn't Converge

· 11 min read
Tian Pan
Software Engineer

The most expensive bug in an agent system is the one with no error message. Worker proposes a draft. Verifier rejects it with a paragraph of feedback. Worker revises. Verifier rejects again. The loop keeps spinning, the trace keeps growing, the bill keeps climbing, and from the outside the system looks like it is working — diligently, in fact, because both models are doing their assigned job. What nobody priced in is that the verifier's acceptance criteria are not fixed across calls. The target the worker is chasing is moving, and the loop has no convergence guarantee.

You shipped "iterate until satisfied," and you shipped a search through a space whose extrema may not exist.