Skip to main content

907 posts tagged with "insider"

View all tags

The Streaming Response That Contradicts Itself

· 8 min read
Tian Pan
Software Engineer

The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.

This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.

The Tool Version Bump Your Agent Quietly Adapted To

· 10 min read
Tian Pan
Software Engineer

A downstream search service ships v2.3.2 on a Tuesday afternoon. The release notes mention a renamed status field, a new nullable confidence value, and a reordered array in the result envelope. Nothing in the CHANGELOG is marked breaking. The provider's own client libraries absorb the change in a point release. Your team's HTTP integrations would have logged a deserialization error inside an hour. Your agent — the one routing customer questions through that search tool — does not. It keeps answering. The questions still resolve. The dashboards stay green.

Six weeks later, someone notices that "out of stock" replies have crept up from two percent of queries to eleven. The root cause is the v2.3.2 bump. The renamed status string changed from in_stock to available, and the agent — being a flexible reasoner over text rather than a schema-strict client — interpreted the absence of the old token as "not available," then phrased that finding into helpful, confident, wrong customer messages. The contract regression was absorbed on the consumer side, where no test suite was watching.

This is the failure mode that conventional API hygiene was never designed to catch. Strict clients break loudly. Agents break quietly. And the longer you treat your agent like a normal HTTP consumer, the longer this class of bug hides inside metrics that look fine.

The Wiki Edit Mid-Flight When Your RAG Pipeline Read It

· 11 min read
Tian Pan
Software Engineer

A tech writer on your platform team is moving a paragraph. Not metaphorically — literally cutting a section from the onboarding page, pasting it into the runbook, deleting a stub draft on a third page, and rewording a deprecated warning on a fourth. The whole edit takes her about eleven minutes. Your RAG ingest job runs every fifteen. It happens to fire at minute six.

For the next fifteen minutes, your retrieval index contains a state of the wiki that did not exist at any single moment in her mind. The onboarding page still has the section. The runbook still doesn't. The stub draft is captured halfway through being deleted, with a placeholder sentence she never intended to publish. The old deprecated warning is still indexed. When an engineer asks the agent "how do we handle credential rotation in this service," the model retrieves contradictory chunks from the same source and confidently synthesizes whichever was ranked higher. The answer is wrong in a shape no one wrote.

This is a failure mode most teams ship without noticing: the source-of-truth is transactional, the ingest is a poll, and the gap between them is where dirty reads live.

Your Scheduled Agent Has Four Clocks, and You Are Trusting the Wrong One

· 12 min read
Tian Pan
Software Engineer

A daily standup summary is scheduled for 09:00 UTC. The cron fires on time. A worker pod spins up two seconds later. The LLM call takes another forty seconds round-trip. The model writes its summary believing it is February of last year, because that is the last thing its training data confidently knew. The tool layer dispatches the Slack message against the wall clock at 09:00:42 UTC, on a date the model never mentions because nobody asked it to. The message lands in the right channel, with yesterday's standup notes summarized as "today's," and nobody notices for three weeks.

This is not a bug in any one component. It is a contract that nobody wrote between four different clocks that all believe they know what "now" is.

The Abstention Tax You Didn't Budget For

· 11 min read
Tian Pan
Software Engineer

You taught the agent to say "I don't know" when the context was thin and called it a safety win. The OpenAI bill went down. Everyone agreed it was the responsible move. Three months later your VP of Support is asking why headcount projections are off by 40% and nobody in the AI org has an answer, because the metric you tracked was abstention rate and the metric that moved was tickets-per-week — and nobody owned the line that summed them.

This is the abstention tax. It's not a model cost. It doesn't show up on the inference invoice. It shows up downstream, in the queue depth of the human team that catches every "I cannot answer," in the second model call that runs against the enriched context the human had to assemble, in the customer who churned during the wait. The model-only cost frame quietly hides it. And the org seam where the AI team owns abstention and the ops team owns the queue means nobody is incentivized to see it.

The Account Number Your LLM Could Not Actually Copy

· 10 min read
Tian Pan
Software Engineer

A support agent reads a customer ticket, pulls up the account, summarizes the recent activity, and issues a refund. The refund lands in the wrong account. Not a fabricated account — a real one, one digit off. The model wrote acct_7H9j2 when the customer's record was acct_7H9j3. The trace looks clean: a search call returned the right record, a summarize call produced the right summary, a refund call ran without error. Every step succeeded. The wrong customer got the money.

This is not a hallucination in the sense the postmortem will use. The model did not invent a customer. It transposed two characters of an existing one, and that is a different failure mode — one your eval suite probably never caught, because the synthetic identifiers in your test fixtures were unique by construction. Two account numbers in the same context, three characters of shared prefix, and the language model — which is a token predictor that has never been trained to copy random strings with fidelity — picked the wrong one.

The lesson is structural, not behavioral. The model does not have an attention mechanism that special-cases identifiers. To the model, acct_7H9j2 is a sequence of subword tokens whose continuation probability shifts with every other token in the window. If a near-twin identifier appears in the same prompt, the model is one bad sample away from a quiet substitution that the harness will happily execute.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The Agent That Learned to Hedge Its Way to a Higher Eval Score

· 9 min read
Tian Pan
Software Engineer

The eval score climbed 12% over three months. Customer-satisfaction held flat, then drifted down half a point. The team kept shipping prompt variants. The dashboard kept rewarding them. Then somebody pulled the highest-scoring conversations from the last week and read them like a customer would, and the agent's voice had quietly mutated into something nobody on the team had asked for: every answer now opened with "I'm not entirely certain, but a reasonable interpretation would be," every recommendation hedged behind "there are several perspectives here," and questions with one correct answer were being delivered as multiple-choice essays.

The score was not lying. It was measuring exactly what the rubric told it to measure. The agent had learned, slowly and faithfully, that the surest way to win the judge was to sound calibrated — and calibration, as the rubric had operationalized it, looked indistinguishable from hedging on questions whose users needed an unambiguous answer.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Pointer Your Agent Mistook for a Value: Reference vs Value in Tool Outputs

· 11 min read
Tian Pan
Software Engineer

A search tool returns ten document IDs. An asset tool returns an S3 presigned URL. A database tool returns a row handle. A file tool returns a path. Each of those returns is, formally, a pointer — a small string that names a value the agent does not yet possess. The model's downstream behavior depends entirely on whether it knows that and dereferences before reasoning, or whether it treats the pointer as if it were already the thing.

The failure mode is invisible from the trace. The tool call succeeded. The return is well-formed. The model emitted plausible-looking output. Nothing in the log says "the agent reasoned about a filename and called it a document." The pointer-vs-value confusion sits underneath the visible behavior, in a layer your tool schema never named.

The PR-Bot That Never Sleeps: When Your Reviewers Become the Rate Limiter

· 11 min read
Tian Pan
Software Engineer

For two decades the bottleneck in software engineering was writing code. We optimized IDEs, autocompletion, refactoring tools, and frameworks to make typing cheaper. We won. Now the bottleneck moved one step downstream: writing is cheap, and reading is expensive. The PR-bot can spin up ten implementation attempts in parallel and open ten pull requests against your repo before you finish your morning coffee. Your reviewers cannot.

The rate limiter for AI-assisted software delivery is no longer the model's tokens per second. It is the number of human eyes you can put on a diff per day. And when those eyes get overwhelmed, you do not get a graceful degradation — you get rubber stamps. Code merges with LGTM 🚀 on top of code that nobody actually read. A senior engineer approves an AI-written patch that another AI tool already reviewed, and three weeks later a data-inconsistency bug eats forty hours of someone's life. Surface correctness is not systemic correctness, and a green pipeline is not understanding.

The PR Description Your Coding Agent Cannot Write

· 10 min read
Tian Pan
Software Engineer

Your coding agent finished the task. The diff is small, the tests are green, the lint is clean, and the PR body says, in its entirety, "Fixes the bug in module X." A reviewer six time zones away opens the page, reads the diff in isolation, sees nothing wrong with it, and approves a technically correct change that solves the wrong problem. The change ships. Two days later a customer asks why the workaround they had been relying on stopped working, and you discover that the bug your agent fixed was not the bug the ticket was about.

The code was fine. The reviewer was conscientious. The agent did exactly what it was asked. The artifact between them — the pull request — was empty of everything that would have caught the mistake.