Skip to main content

861 posts tagged with "insider"

View all tags

The Cached Prompt Prefix That Grew Arms and Legs

· 11 min read
Tian Pan
Software Engineer

Six months ago your prompt prefix was 4,000 tokens. It was stable, cache-warm, and amortized to almost nothing — the per-call surcharge for system instructions was a rounding error against the per-call cost of the response. Today that prefix is 11,000 tokens, your cache hit rate has slid from 92% to 31%, and your inference bill is up 4x. Nobody on the team can point to the PR that did it. There is no commit message saying "increase prompt tokens by 7,000." Every change was small, every change was defended, every change shipped clean.

The prefix grew arms and legs the way a basement collects boxes. One team needed the user's tier injected so the agent could explain plan limits. Another needed today's date in the user's timezone for "remind me tomorrow" to work. A third stapled in the active A/B variant name so eval traces could be sliced. Marketing added the current promo banner so the agent could mention it on prompt. Compliance added a feature-flag manifest so the model could refuse beta features for users not in the rollout. Each was a one-line addition. Each was defensible in isolation. The aggregate destroyed your cache.

The Streaming Response That Committed Before the User Said Yes

· 12 min read
Tian Pan
Software Engineer

The user is reading the agent's reasoning as it streams in. Around token 1200, the model decides to call send_email, then create_ticket, then kick_off_deploy. The user, watching the partial output and realizing the agent has misread the request, hits the stop button half a second too late. The email is already sent. The ticket is already filed. The deploy is already running. The stop button cancelled the next token, not the consequences of the last one.

The bug is not in the cancel handler. The bug is the assumption — borrowed from every other streaming UI on the team's roadmap — that an incrementally rendered output is an incrementally reversible one. Tool calls do not honor that contract. They are point-in-time commits that the streaming layer happily fires while the rest of the response is still being generated, and the cancel button has no way to chase them down the wire.

This is one of those failure modes that nobody owns because it lives in the seam between two teams that each shipped their half cleanly. The UX team shipped streaming because it tested better in user studies. The platform team shipped tool calls because the framework supports them. Neither team had a meeting where someone asked: what is "stop" supposed to mean when the response has already left the building?

A Prompt Diff Hides Its Own Blast Radius

· 9 min read
Tian Pan
Software Engineer

A pull request lands in your review queue. The diff shows three words changed inside a system prompt: Output strictly valid JSON became Always respond using clean, parseable JSON. It reads like a copy edit. You skim it, the CI checkmark is green, and you click approve. Total time: ninety seconds.

Six hours later, the downstream parser starts rejecting responses with trailing commas and missing fields. The structured-output error rate climbs from near-zero to double digits, and a revenue-generating workflow stalls. Nothing in the diff predicted this. Nothing in the diff could have predicted this, because the diff measured the wrong thing.

This is the central problem with reviewing prompt changes: the size of a prompt diff tells you nothing about the size of its effect. A three-word change and a three-paragraph rewrite are both just text, and a text diff renders them with the same visual weight as any other edit. But a prompt is not text that describes behavior — it is text that causes behavior, and the causal blast radius of an edit is invisible in the artifact you are reviewing.

Re-Ask Rate: The Failure Signal Your Eval Pipeline Never Extracts

· 10 min read
Tian Pan
Software Engineer

Open any production chat transcript long enough and you will find a user who asks the same question three times. The phrasing changes a little each turn — pronouns swap to nouns, a clarifier gets bolted on, the polite hedge falls away by the third try — but the underlying request is identical. They are not asking three questions. They are asking the same question, and the agent is failing to answer it, and the user is hoping that this time the words will land differently.

The transcript-level signal here is so loud it is almost obscene. The user has told you, with their own keystrokes, that the previous response did not help. They did not need to fill out a survey. They did not need to leave a thumbs-down. They told you by typing the question again. And in most production AI stacks, this signal is silently discarded by an eval pipeline that scores each turn in isolation and a satisfaction survey that only fires at session end — by which point the user who re-asked three times has usually already churned and will never grade anything.

Shadow Replay Punishes the Model That Would Have Changed the Conversation

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a new model into shadow replay and watched the win rate sit at 47 percent against the incumbent. Same prompts, same retrieval, a model the vendor's own evals had ranked clearly higher. The shadow harness took last week's production traffic, pumped it through the candidate, fed both responses to an LLM judge, and declared the upgrade roughly a coin flip. The team almost reverted on the spot.

The problem was not the model. The problem was that every user message in the replay had already been conditioned on the old model's previous turn. The candidate wrote a better answer at turn one, the user in the log replied to a different answer that no longer existed, and from turn two onward the judge was scoring a conversation that was not happening. A genuinely better model that changes what the user does next has no ground truth to be scored against. The replay quietly rewards staying on the old rails.

The Agent That Read Last Week's Slack Like It Was Yesterday

· 10 min read
Tian Pan
Software Engineer

Your operations agent answers a question about the upcoming launch by quoting a Slack message that says "we'll ship tomorrow." The agent treats that as a present-tense plan and starts writing comms. The message was posted six weeks ago. The ship happened. The retrieval pipeline pulled the right chunk by every metric you measure — semantic similarity to "launch date," top-1 confidence above your threshold, source channel matching the project — and the agent built a plan on a sentence that meant something only inside the meeting where it was written.

The bug is not in the model. The bug is that tomorrow is not a date. It is a pointer to a clock, and the clock the message was written against is not the clock the agent is reading it on. Your retrieval pipeline indexed the body of the message and discarded the frame.

The Demo Worked Because You Were Watching: Session Length Is the Eval Dimension Your Suite Forgot

· 10 min read
Tian Pan
Software Engineer

The reliability number in your launch deck came from sessions that looked nothing like the ones your users actually run. The demo was five turns: open, ask, observe a tidy answer, refine once, conclude on a high note. The session your power user ran yesterday was thirty-one turns long, included two tool failures the agent papered over with optimism, and ended when the user gave up and opened a support ticket. Both sessions came out of the same model. The first one shipped a press release. The second one was filed under "edge case."

Session length is a dimension of evaluation, and demo culture systematically underweights it. We measure per-turn accuracy because per-turn accuracy is what fits on a slide, and then we are surprised when per-session success falls off a cliff that we never put on any chart. The cliff is not random and it is not a tail event — it is the predictable consequence of compounding error, attention drift, and committed assumptions that the model will not revisit. The question every team should be asking is not "how good is the model" but "how good is the model at turn twenty-eight, given everything we said at turns one through twenty-seven."

The Embedding That Aged Out of Meaning

· 9 min read
Tian Pan
Software Engineer

You embedded the knowledge base eighteen months ago. The model has not changed. The chunks have not changed. The index is healthy, the latency is fine, the recall dashboard is a flat line at 0.86. And yet support is quietly pasting the wrong article links into ticket replies, the sales bot keeps surfacing a deprecated SKU when a prospect asks about the new one, and an internal user just told you the assistant "feels dumber" without being able to say why.

Nothing broke. Your embeddings aged. The word post used to mean blog post in your domain; now half the corpus uses it for a Slack post, a forum post, and a job posting, and your eighteen-month-old vectors still treat it as one concept. The model that encoded those vectors never saw the new senses, never saw the new product names, never saw the rebrand, never saw the regulation that introduced three new terms your customers now use without thinking. The retrieval system answers the question it knows how to answer, which is no longer the question your users are asking.

The Filler Tool Call: When Agents Perform Diligence Instead of Doing Work

· 9 min read
Tian Pan
Software Engineer

Open the trace of any production agent and look at the tool calls that ran between the user's question and the first useful action. You will find a get_user_profile that returned a name nobody used, a check_status that came back green and was never referenced, a list_recent_orders whose result was summarized as "ok" and dropped on the floor. None of these calls changed the answer. All of them cost real money, real latency, and a real line in the trace. Your agent has learned to look diligent — and looking diligent is now your single largest source of waste.

This is the filler tool call: an action the agent emits not because it needs the result, but because the surrounding pattern of "thinking out loud, then acting" has been rewarded enough times during training that the model now performs thoroughness as a side effect of answering anything. It is the LLM equivalent of a junior analyst opening five tabs they never read so the senior across the room sees activity. The difference is that the junior gets bored. The agent never does.

The First-Time User Cliff Your Aggregate Metrics Are Hiding

· 10 min read
Tian Pan
Software Engineer

Your AI feature looks healthy. Weekly active is flat-to-up, satisfaction scores are positive, the dashboard says ship more of this. The PM cites the metric in the next planning round. The engineering lead nods. The roadmap gets another adjacent feature.

Then someone segments the chart by user tenure and the picture inverts. Long-time users — the ones who were already there when the feature shipped — go deep on it daily. First-time users bounce within two interactions. The "flat" line is two cohorts cancelling each other out: a power curve sloping up, and a churn curve sloping down, summed into a lie.

The OOO Auto-Reply Your Agent Did Not Read

· 8 min read
Tian Pan
Software Engineer

Your support agent pages a human at 2 a.m. The human has been out for a week. The OOO message lives in the same inbox the agent is reading. The agent pings the human anyway. The auto-reply lands. The agent thanks it politely and pings again, because the reply did not contain the resolution code it was waiting on. Twelve cycles in, somebody on a different team notices the unread thread is now sixty messages deep and goes manually wake up the on-call.

The agent did exactly what the prompt told it to do. The prompt told it to escalate to a person. The person was a string, not a role. The string did not know about PTO.