Skip to main content

907 posts tagged with "insider"

View all tags

A Prompt Diff Hides Its Own Blast Radius

· 9 min read
Tian Pan
Software Engineer

A pull request lands in your review queue. The diff shows three words changed inside a system prompt: Output strictly valid JSON became Always respond using clean, parseable JSON. It reads like a copy edit. You skim it, the CI checkmark is green, and you click approve. Total time: ninety seconds.

Six hours later, the downstream parser starts rejecting responses with trailing commas and missing fields. The structured-output error rate climbs from near-zero to double digits, and a revenue-generating workflow stalls. Nothing in the diff predicted this. Nothing in the diff could have predicted this, because the diff measured the wrong thing.

This is the central problem with reviewing prompt changes: the size of a prompt diff tells you nothing about the size of its effect. A three-word change and a three-paragraph rewrite are both just text, and a text diff renders them with the same visual weight as any other edit. But a prompt is not text that describes behavior — it is text that causes behavior, and the causal blast radius of an edit is invisible in the artifact you are reviewing.

Re-Ask Rate: The Failure Signal Your Eval Pipeline Never Extracts

· 10 min read
Tian Pan
Software Engineer

Open any production chat transcript long enough and you will find a user who asks the same question three times. The phrasing changes a little each turn — pronouns swap to nouns, a clarifier gets bolted on, the polite hedge falls away by the third try — but the underlying request is identical. They are not asking three questions. They are asking the same question, and the agent is failing to answer it, and the user is hoping that this time the words will land differently.

The transcript-level signal here is so loud it is almost obscene. The user has told you, with their own keystrokes, that the previous response did not help. They did not need to fill out a survey. They did not need to leave a thumbs-down. They told you by typing the question again. And in most production AI stacks, this signal is silently discarded by an eval pipeline that scores each turn in isolation and a satisfaction survey that only fires at session end — by which point the user who re-asked three times has usually already churned and will never grade anything.

Shadow Replay Punishes the Model That Would Have Changed the Conversation

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a new model into shadow replay and watched the win rate sit at 47 percent against the incumbent. Same prompts, same retrieval, a model the vendor's own evals had ranked clearly higher. The shadow harness took last week's production traffic, pumped it through the candidate, fed both responses to an LLM judge, and declared the upgrade roughly a coin flip. The team almost reverted on the spot.

The problem was not the model. The problem was that every user message in the replay had already been conditioned on the old model's previous turn. The candidate wrote a better answer at turn one, the user in the log replied to a different answer that no longer existed, and from turn two onward the judge was scoring a conversation that was not happening. A genuinely better model that changes what the user does next has no ground truth to be scored against. The replay quietly rewards staying on the old rails.

The Agent That Read Last Week's Slack Like It Was Yesterday

· 10 min read
Tian Pan
Software Engineer

Your operations agent answers a question about the upcoming launch by quoting a Slack message that says "we'll ship tomorrow." The agent treats that as a present-tense plan and starts writing comms. The message was posted six weeks ago. The ship happened. The retrieval pipeline pulled the right chunk by every metric you measure — semantic similarity to "launch date," top-1 confidence above your threshold, source channel matching the project — and the agent built a plan on a sentence that meant something only inside the meeting where it was written.

The bug is not in the model. The bug is that tomorrow is not a date. It is a pointer to a clock, and the clock the message was written against is not the clock the agent is reading it on. Your retrieval pipeline indexed the body of the message and discarded the frame.

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

· 10 min read
Tian Pan
Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

The Embedding That Aged Out of Meaning

· 9 min read
Tian Pan
Software Engineer

You embedded the knowledge base eighteen months ago. The model has not changed. The chunks have not changed. The index is healthy, the latency is fine, the recall dashboard is a flat line at 0.86. And yet support is quietly pasting the wrong article links into ticket replies, the sales bot keeps surfacing a deprecated SKU when a prospect asks about the new one, and an internal user just told you the assistant "feels dumber" without being able to say why.

Nothing broke. Your embeddings aged. The word post used to mean blog post in your domain; now half the corpus uses it for a Slack post, a forum post, and a job posting, and your eighteen-month-old vectors still treat it as one concept. The model that encoded those vectors never saw the new senses, never saw the new product names, never saw the rebrand, never saw the regulation that introduced three new terms your customers now use without thinking. The retrieval system answers the question it knows how to answer, which is no longer the question your users are asking.

The Filler Tool Call: When Agents Perform Diligence Instead of Doing Work

· 9 min read
Tian Pan
Software Engineer

Open the trace of any production agent and look at the tool calls that ran between the user's question and the first useful action. You will find a get_user_profile that returned a name nobody used, a check_status that came back green and was never referenced, a list_recent_orders whose result was summarized as "ok" and dropped on the floor. None of these calls changed the answer. All of them cost real money, real latency, and a real line in the trace. Your agent has learned to look diligent — and looking diligent is now your single largest source of waste.

This is the filler tool call: an action the agent emits not because it needs the result, but because the surrounding pattern of "thinking out loud, then acting" has been rewarded enough times during training that the model now performs thoroughness as a side effect of answering anything. It is the LLM equivalent of a junior analyst opening five tabs they never read so the senior across the room sees activity. The difference is that the junior gets bored. The agent never does.

The First-Time User Cliff Your Aggregate Metrics Are Hiding

· 10 min read
Tian Pan
Software Engineer

Your AI feature looks healthy. Weekly active is flat-to-up, satisfaction scores are positive, the dashboard says ship more of this. The PM cites the metric in the next planning round. The engineering lead nods. The roadmap gets another adjacent feature.

Then someone segments the chart by user tenure and the picture inverts. Long-time users — the ones who were already there when the feature shipped — go deep on it daily. First-time users bounce within two interactions. The "flat" line is two cohorts cancelling each other out: a power curve sloping up, and a churn curve sloping down, summed into a lie.

The OOO Auto-Reply Your Agent Did Not Read

· 8 min read
Tian Pan
Software Engineer

Your support agent pages a human at 2 a.m. The human has been out for a week. The OOO message lives in the same inbox the agent is reading. The agent pings the human anyway. The auto-reply lands. The agent thanks it politely and pings again, because the reply did not contain the resolution code it was waiting on. Twelve cycles in, somebody on a different team notices the unread thread is now sixty messages deep and goes manually wake up the on-call.

The agent did exactly what the prompt told it to do. The prompt told it to escalate to a person. The person was a string, not a role. The string did not know about PTO.

What You Deleted Is Invisible to Your Coding Agent

· 10 min read
Tian Pan
Software Engineer

You spent Tuesday afternoon deleting a dead utility module. You cleaned up the imports, ran the type checker, watched CI go green, and merged the PR. Wednesday morning, a fresh agent session looks at the same code, decides the codebase is "missing" a small helper, and writes the dead module back in — same name, same shape, slightly different style. The reviewer who approved the deletion yesterday now has to remember why they killed it, find the conversation that justified it, and explain it again. The agent is not malfunctioning. It is doing exactly what its context says to do.

This is the structural reliability problem of coding agents that nobody is solving with prompt engineering: the agent's context starts from the repository's current state, but not from the history of why that state is what it is. The file you removed leaves no trace the agent can see. The dependency you migrated away from is just another package on npm. The flaky test you intentionally deleted is a coverage gap waiting to be "fixed." Absence — the negative space of decisions you made — is invisible.

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.