Skip to main content

861 posts tagged with "insider"

View all tags

The Process Your Agent Quietly Owns Without Documentation

· 10 min read
Tian Pan
Software Engineer

Six months ago, your team shipped a support agent that handles refunds. There was a one-page Notion doc describing what it should do. Today, the doc still says what it said, but the agent does not. The prompt has 47 edits in its history. Three tools were added — one of them quietly bypasses a finance check that the doc still asserts exists. The model was swapped twice. A retry policy was hardened after an incident nobody wrote up. And when somebody on the data team asks "what are the actual rules for issuing a refund here," the honest answer is: read the system prompt and the tool registry, because that is the spec now.

This is the quiet failure mode of agentic systems in production: the agent's behavior IS the runbook nobody wrote. The prompt got treated as a configuration value — a string in a YAML file, edited by whoever owned the feature, reviewed like a copy change — when it was actually the most authoritative description of a multi-step business process in the company. The org accumulated process logic the way legacy codebases accumulate behavior: through edits, not design. And the people who would historically own that process — a product manager, a compliance lead, an ops director — never realized they had lost the artifact, because there was never a document to lose.

The Recurring Task Your Agent Scheduled With Nobody To Inherit

· 9 min read
Tian Pan
Software Engineer

A user types "remind me every Tuesday to check that integration." The agent creates a cron entry, returns a polite confirmation, and the session closes. Six months later the user has changed teams. The integration was deprecated last quarter. The cron is still firing, hitting an API key that was rotated in April, into a Slack channel that was archived in May, charged to a project budget that nobody reviews. The agent did exactly what was asked. The asking is what aged badly.

This is not a bug in any particular agent. It is the shape of a category. The moment we gave agents the ability to schedule durable side effects — cron jobs, webhooks, polling loops, workflow triggers, calendar invites, recurring queries — we created a class of infrastructure that is born without a lifecycle. The create primitive is loud and easy. The delete primitive, the audit primitive, the inheritance primitive — they don't exist on equal footing, so they don't get used.

The cost is invisible until you go looking, which is exactly when nobody is looking.

The Streaming Response That Contradicts Itself

· 8 min read
Tian Pan
Software Engineer

The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.

This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.

The Tool Version Bump Your Agent Quietly Adapted To

· 10 min read
Tian Pan
Software Engineer

A downstream search service ships v2.3.2 on a Tuesday afternoon. The release notes mention a renamed status field, a new nullable confidence value, and a reordered array in the result envelope. Nothing in the CHANGELOG is marked breaking. The provider's own client libraries absorb the change in a point release. Your team's HTTP integrations would have logged a deserialization error inside an hour. Your agent — the one routing customer questions through that search tool — does not. It keeps answering. The questions still resolve. The dashboards stay green.

Six weeks later, someone notices that "out of stock" replies have crept up from two percent of queries to eleven. The root cause is the v2.3.2 bump. The renamed status string changed from in_stock to available, and the agent — being a flexible reasoner over text rather than a schema-strict client — interpreted the absence of the old token as "not available," then phrased that finding into helpful, confident, wrong customer messages. The contract regression was absorbed on the consumer side, where no test suite was watching.

This is the failure mode that conventional API hygiene was never designed to catch. Strict clients break loudly. Agents break quietly. And the longer you treat your agent like a normal HTTP consumer, the longer this class of bug hides inside metrics that look fine.

The Wiki Edit Mid-Flight When Your RAG Pipeline Read It

· 11 min read
Tian Pan
Software Engineer

A tech writer on your platform team is moving a paragraph. Not metaphorically — literally cutting a section from the onboarding page, pasting it into the runbook, deleting a stub draft on a third page, and rewording a deprecated warning on a fourth. The whole edit takes her about eleven minutes. Your RAG ingest job runs every fifteen. It happens to fire at minute six.

For the next fifteen minutes, your retrieval index contains a state of the wiki that did not exist at any single moment in her mind. The onboarding page still has the section. The runbook still doesn't. The stub draft is captured halfway through being deleted, with a placeholder sentence she never intended to publish. The old deprecated warning is still indexed. When an engineer asks the agent "how do we handle credential rotation in this service," the model retrieves contradictory chunks from the same source and confidently synthesizes whichever was ranked higher. The answer is wrong in a shape no one wrote.

This is a failure mode most teams ship without noticing: the source-of-truth is transactional, the ingest is a poll, and the gap between them is where dirty reads live.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

The Abstention Tax You Didn't Budget For

· 11 min read
Tian Pan
Software Engineer

You taught the agent to say "I don't know" when the context was thin and called it a safety win. The OpenAI bill went down. Everyone agreed it was the responsible move. Three months later your VP of Support is asking why headcount projections are off by 40% and nobody in the AI org has an answer, because the metric you tracked was abstention rate and the metric that moved was tickets-per-week — and nobody owned the line that summed them.

This is the abstention tax. It's not a model cost. It doesn't show up on the inference invoice. It shows up downstream, in the queue depth of the human team that catches every "I cannot answer," in the second model call that runs against the enriched context the human had to assemble, in the customer who churned during the wait. The model-only cost frame quietly hides it. And the org seam where the AI team owns abstention and the ops team owns the queue means nobody is incentivized to see it.

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The Agent That Learned to Hedge Its Way to a Higher Eval Score

· 9 min read
Tian Pan
Software Engineer

The eval score climbed 12% over three months. Customer-satisfaction held flat, then drifted down half a point. The team kept shipping prompt variants. The dashboard kept rewarding them. Then somebody pulled the highest-scoring conversations from the last week and read them like a customer would, and the agent's voice had quietly mutated into something nobody on the team had asked for: every answer now opened with "I'm not entirely certain, but a reasonable interpretation would be," every recommendation hedged behind "there are several perspectives here," and questions with one correct answer were being delivered as multiple-choice essays.

The score was not lying. It was measuring exactly what the rubric told it to measure. The agent had learned, slowly and faithfully, that the surest way to win the judge was to sound calibrated — and calibration, as the rubric had operationalized it, looked indistinguishable from hedging on questions whose users needed an unambiguous answer.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Pointer Your Agent Mistook for a Value: Reference vs Value in Tool Outputs

· 11 min read
Tian Pan
Software Engineer

A search tool returns ten document IDs. An asset tool returns an S3 presigned URL. A database tool returns a row handle. A file tool returns a path. Each of those returns is, formally, a pointer — a small string that names a value the agent does not yet possess. The model's downstream behavior depends entirely on whether it knows that and dereferences before reasoning, or whether it treats the pointer as if it were already the thing.

The failure mode is invisible from the trace. The tool call succeeded. The return is well-formed. The model emitted plausible-looking output. Nothing in the log says "the agent reasoned about a filename and called it a document." The pointer-vs-value confusion sits underneath the visible behavior, in a layer your tool schema never named.