Skip to main content

702 posts tagged with "llm"

View all tags

The Trace That Stops at the Provider Boundary

· 11 min read
Tian Pan
Software Engineer

You did the tracing work. Retrieval has a span. Tool calls have spans. The orchestration loop has a span. A trace ID rides through every internal hop on W3C traceparent headers, just like the SRE playbook says. Then the request hits messages.create, the SDK records a single span called llm.call, and the next 2.8 seconds of your pipeline turn into a black rectangle on the flame graph with no internal structure. The 800 milliseconds before the first token shows up: opaque. The 2 seconds of decode after that: opaque. The share of the wall clock that was network, queue wait, prefill, or per-token decode: unknowable from your trace.

When a customer reports "the assistant felt slow today," your dashboard can confirm the slowness. It cannot localize it. The most expensive minute of your pipeline — measured in dollars, in p95, in user-visible lag — lives inside a vendor's data center, and the contract you accepted when you signed up gives you almost no visibility into it. You are on call for a black box.

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

· 10 min read
Tian Pan
Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

The Agent That Could Not Say Wait

· 10 min read
Tian Pan
Software Engineer

Pick any production agent built in the last two years and inventory the things it can actually do on a given turn. The list is short: emit a tool call, return a final answer, or ask the user a clarifying question. That is the entire action vocabulary. Notice what is missing. There is no verb for "I would like more time before deciding." There is no verb for "I am uncertain enough that I want to pause and reconsider without committing." There is no verb for "I want to dwell on this for a moment before I do anything." The agent literally cannot say wait. The grammar does not contain the word.

This is not a polish problem. It is a structural one. The moment the agent's only outputs are actions, every internal state has to be expressed through an action. Hesitation becomes a redundant tool call. Doubt becomes a confident commitment. The team that designed only the action verbs has shipped an agent whose only language is doing, and then they wonder why it never seems to think.

The Planner That Treated Every Tool as O(1)

· 9 min read
Tian Pan
Software Engineer

Your planner emits five tool calls. On paper, it reads like a clean solution: lookup_user, search_documents, call_external_api, spawn_sub_agent, request_human_approval. The trace looks elegant, the logic is sound, the agent will arrive at the right answer. In production, those five steps take 12 milliseconds, 800 milliseconds, 4 seconds, 2 minutes, and 6 hours respectively. The planner never noticed that its five-step plan spans nine orders of magnitude in cost.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Planner%20That%20Treated%20Every%20Tool%20as%20O(1%29)

This is not a hallucination. The model picked the right tools. It picked them in a sensible order. What it could not do — what the tool schema gave it no way to do — was reason about the fact that the last step in its plan is qualitatively different from the first one. To the planner, a tool is a tool. Every node in the plan graph has weight one.

The Verifier Loop That Couldn't Converge

· 11 min read
Tian Pan
Software Engineer

The most expensive bug in an agent system is the one with no error message. Worker proposes a draft. Verifier rejects it with a paragraph of feedback. Worker revises. Verifier rejects again. The loop keeps spinning, the trace keeps growing, the bill keeps climbing, and from the outside the system looks like it is working — diligently, in fact, because both models are doing their assigned job. What nobody priced in is that the verifier's acceptance criteria are not fixed across calls. The target the worker is chasing is moving, and the loop has no convergence guarantee.

You shipped "iterate until satisfied," and you shipped a search through a space whose extrema may not exist.

The Cached Prompt Prefix That Grew Arms and Legs

· 11 min read
Tian Pan
Software Engineer

Six months ago your prompt prefix was 4,000 tokens. It was stable, cache-warm, and amortized to almost nothing — the per-call surcharge for system instructions was a rounding error against the per-call cost of the response. Today that prefix is 11,000 tokens, your cache hit rate has slid from 92% to 31%, and your inference bill is up 4x. Nobody on the team can point to the PR that did it. There is no commit message saying "increase prompt tokens by 7,000." Every change was small, every change was defended, every change shipped clean.

The prefix grew arms and legs the way a basement collects boxes. One team needed the user's tier injected so the agent could explain plan limits. Another needed today's date in the user's timezone for "remind me tomorrow" to work. A third stapled in the active A/B variant name so eval traces could be sliced. Marketing added the current promo banner so the agent could mention it on prompt. Compliance added a feature-flag manifest so the model could refuse beta features for users not in the rollout. Each was a one-line addition. Each was defensible in isolation. The aggregate destroyed your cache.

The Power User Who Learned Your Prompt By Trial

· 10 min read
Tian Pan
Software Engineer

There is a user in your product right now who is having a much better experience than the median. Not because they pay more, not because they have a different tier, not because they were rolled into a different cohort. They have figured out, through patient probing, that the AI feature responds beautifully if you ask in a certain way. They know which verbs trigger the structured output. They know that a one-word follow-up gives them the terse version and a complete sentence gives them the expansive one. They know that the assistant gets defensive about certain topics unless you frame the question as a hypothetical. None of this is written down anywhere on your site. They reverse-engineered it.

The interesting thing is not that this user exists. It is that this user is now your documentation. Your AI feature has a contract with its users — an undocumented one, encoded entirely in the system prompt — and the only way anyone learns the contract is by trial. A small fraction of users have the patience to run those trials. Everyone else gets a worse product.

The Production Logs Your Agent Cannot Read

· 9 min read
Tian Pan
Software Engineer

You wired your incident-response agent into Splunk. You gave it the query syntax in the system prompt, a tool to execute SPL, and a fresh API token. The first time it triaged a real page, it pulled the wrong logs, summarized the wrong service, and confidently named the wrong customer. The integration was perfect. The agent was useless.

Here is what you forgot. Fifteen years of log conventions, undocumented field names, severity strings that drifted from ERR to error to ERROR across three reorgs, and team-specific suffixes that turn customer_id into cust_id_v2_actual on the auth service and tenant.user.id on billing — none of that is in the prompt. You gave the agent access to the API. You did not give it access to the institutional knowledge that makes the API useful.

The shape of this failure is bigger than Splunk. It applies to any agent integration where the tool exposes a query language over a corpus the team has been shaping by hand for a decade. The agent has the verbs. It does not have the nouns.

A Prompt Diff Hides Its Own Blast Radius

· 9 min read
Tian Pan
Software Engineer

A pull request lands in your review queue. The diff shows three words changed inside a system prompt: Output strictly valid JSON became Always respond using clean, parseable JSON. It reads like a copy edit. You skim it, the CI checkmark is green, and you click approve. Total time: ninety seconds.

Six hours later, the downstream parser starts rejecting responses with trailing commas and missing fields. The structured-output error rate climbs from near-zero to double digits, and a revenue-generating workflow stalls. Nothing in the diff predicted this. Nothing in the diff could have predicted this, because the diff measured the wrong thing.

This is the central problem with reviewing prompt changes: the size of a prompt diff tells you nothing about the size of its effect. A three-word change and a three-paragraph rewrite are both just text, and a text diff renders them with the same visual weight as any other edit. But a prompt is not text that describes behavior — it is text that causes behavior, and the causal blast radius of an edit is invisible in the artifact you are reviewing.

The Confidence Score Your Users Learned to Ignore

· 11 min read
Tian Pan
Software Engineer

You wanted to be honest. You put a little "92%" next to every answer your agent gave. After the third time the agent was confidently wrong at 92%, your users stopped reading the number. They did not get angry about it. They just learned, the way humans always learn around a misbehaving signal, that the gauge on the dashboard is not connected to the engine. The number is still there. It costs you tokens to produce it. It informs no decision anyone makes.

This is the failure mode that calibration UX research keeps rediscovering: surfacing a probability is a trust commitment, and the commitment goes one direction. The moment the number turns out to be uncorrelated with correctness in the user's lived experience, the score is dead — and the trust you spent putting it there is dead with it. You cannot un-ring that bell by fixing the number later. The number is now decoration.

The Filler Tool Call: When Agents Perform Diligence Instead of Doing Work

· 9 min read
Tian Pan
Software Engineer

Open the trace of any production agent and look at the tool calls that ran between the user's question and the first useful action. You will find a get_user_profile that returned a name nobody used, a check_status that came back green and was never referenced, a list_recent_orders whose result was summarized as "ok" and dropped on the floor. None of these calls changed the answer. All of them cost real money, real latency, and a real line in the trace. Your agent has learned to look diligent — and looking diligent is now your single largest source of waste.

This is the filler tool call: an action the agent emits not because it needs the result, but because the surrounding pattern of "thinking out loud, then acting" has been rewarded enough times during training that the model now performs thoroughness as a side effect of answering anything. It is the LLM equivalent of a junior analyst opening five tabs they never read so the senior across the room sees activity. The difference is that the junior gets bored. The agent never does.