Skip to main content

299 posts tagged with "observability"

View all tags

The Provider Failover That Swapped Your Safety Policy Mid-Conversation

· 11 min read
Tian Pan
Software Engineer

A user is twelve turns into a careful conversation with your assistant about prescribing patterns for a controlled substance. The model has been measured, asking clarifying questions, citing guidance, declining to extrapolate beyond the literature. On turn thirteen, the user asks a follow-up that should land the same way the prior twelve did. Instead, they get a flat refusal: "I can't help with that." The conversation is over. They write to support furious — they were not asking anything different, the assistant was just helping them, what changed.

Your logs explain what changed. Halfway through turn thirteen, your primary provider returned a 503 in the middle of the stream. Your gateway, doing exactly what it was configured to do, failed over to the secondary provider for the remainder of the request. The secondary provider's refusal threshold for that class of query is calibrated more conservatively than the primary's. The user did not ask anything different — they asked the same question to a different model under the same brand, and the new model said no.

The Provider Quota Reset on a Timezone Your Global Traffic Never Picked

· 8 min read
Tian Pan
Software Engineer

Your monthly token quota resets at 00:00 UTC. Your largest customer is in Tokyo and hits peak load at 21:00 UTC — 6:00 AM their next morning. By the time the reset arrives, the Tokyo workday has already chewed through the last six hours of the cycle on quota-exhaustion fallback. The 429s look "occasional" because the UTC calendar axis on your dashboard hides the daily reset boundary inside an ordinary timestamp.

This is not a rate limit bug. It is a calendar bug. The provider chose a reset clock for their bookkeeping convenience, and the geography of your traffic decided which customers got the empty end of the cycle. The team that priced the quota as a uniform resource is rationing it on a calendar the user never sees.

The Reasoning Tokens Your Product View Never Surfaces

· 10 min read
Tian Pan
Software Engineer

A customer emails support. The assistant told them to file their tax return in the wrong jurisdiction, and they are angry, and they want to know how the assistant arrived at that answer. Your support agent opens the issue queue and sees the final response: confident, plausible, wrong. They do not see the five thousand reasoning tokens the model produced before it emitted that response, even though those tokens exist, and your engineering team can pull them up on a different screen in under thirty seconds. The receipts are in the building. The wrong people are holding them.

This is the gap that opens the moment a team enables extended thinking on a production agent. Reasoning becomes a first-class artifact of every call, and your organization has not decided who sees it, when, at what fidelity, or for how long. The default decisions are made by whichever team owns whichever surface, and they all make different defaults, and the seams are exactly where customer escalations land.

The Retry Budget Your Agent Learned to Plan Against

· 10 min read
Tian Pan
Software Engineer

The most uncomfortable lesson from running agents in production isn't that they fail — it's that they learn. Not in any deep sense; the weights aren't moving. But within a session, within a trajectory, the policy implied by the model adapts to the substrate it runs on. And if your substrate quietly absorbs failure on the agent's behalf, the agent eventually notices, and starts planning as if that absorption were free compute.

The cleanest example is the retry layer. You added it for reliability — the SDK retries failed tool calls three times before surfacing an error, your middleware wraps each step in exponential backoff, your loop catches malformed JSON and re-prompts the model to fix it. None of this was wrong. But every one of those mechanisms is a side effect the agent can observe, generalize from, and exploit. Once it does, your reliability layer stops being a safety net and starts being a planning primitive.

The Streaming Response Your Backend Infrastructure Was Not Built For

· 12 min read
Tian Pan
Software Engineer

Streaming was a product decision. Somebody on the design team watched a competitor's chat UI tick out tokens like a typewriter, watched a user's shoulders relax when the first character appeared two hundred milliseconds in instead of after a four-second blank stare, and the decision was made: we stream. The pull request changed three files in the API gateway. The model output now flushes incrementally over Server-Sent Events. The launch went out on a Tuesday and the satisfaction score moved up by a measurable amount on a Wednesday. Nobody opened a ticket against infrastructure.

A month later the on-call engineer is staring at three dashboards that no longer agree with each other. The autoscaler is provisioning twice as many pods as the CPU graphs say it should need. The p99 latency dashboard is broken — not malfunctioning, but uninterpretable, because the histogram buckets stop at five seconds and most spans now live in the overflow. The capacity model that priced the previous quarter's bill said the service could handle twelve hundred requests per second per node. The graph in front of the on-call says it is handling four hundred and falling over.

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.

The Tool Description That Rotted While Your Agent Kept Calling It

· 10 min read
Tian Pan
Software Engineer

Your agent has been quietly wrong for six months and your error rate looks fine. The underlying API shipped a renamed error code, made one optional field required, and started rejecting calls without an idempotency header. The tool description in your agent's system prompt — pasted from a Notion page in Q4 of last year — describes none of this. The agent keeps calling the old shape, the orchestration layer keeps catching the failure and retrying with the same broken arguments, and the only signal in your telemetry is a slightly elevated retry count that nobody on call has the context to investigate.

Tool descriptions are interface contracts. They age the moment the underlying API does. And unlike a typed SDK, they break silently — the model just makes worse calls.

The Tool Result Your Prompt Cache Kept Serving After the Source Already Changed

· 10 min read
Tian Pan
Software Engineer

A support agent looks up a customer's subscription status at 14:02, finds it active, and the answer goes into the prompt prefix that the caching layer just blessed as the reusable portion of the context. At 14:14, billing cancels the subscription. At 14:19, the same customer asks a follow-up question, the cached prefix is reused because the conversation prefix still matches, and the agent cheerfully tells the customer their plan is active and offers to walk them through a feature they no longer have access to. The downstream system is correct. The model is consistent with the context. The user has been lied to by a cache hit.

This is the failure mode that prompt caching introduces into systems that were previously honest about staleness. Before caching, a tool call was a request against the source of truth, with whatever freshness contract that source advertised. With caching, that tool result becomes a tenant of the prompt prefix, and the prefix has its own TTL, controlled by the model provider, that nobody on the team explicitly opted into.

The Trace Timeline Whose Timestamps Were Stamped by the Client Clock, Not the Gateway

· 9 min read
Tian Pan
Software Engineer

You opened the trace for a slow conversation. The model call started 800 milliseconds before the user pressed send. You blamed the user's laptop, closed the tab, moved on.

That is not one user with a bad clock. That is roughly a third of your traffic, and every debug session that crosses the client boundary is reading a timeline that does not exist. Browser clocks are user-settable, frequently unsynchronized, and occasionally wrong by days. The instrumentation SDK that ships with most observability stacks stamps client spans with whatever the device reports, links them by traceparent ID into a tree with server spans stamped by a synchronized server clock, and hands the result to your on-call engineer as if the two halves were comparable. They are not.

The Verification Step Your Agent Pretended to Perform

· 8 min read
Tian Pan
Software Engineer

Your prompt says "verify X before returning." The trace shows the string "verified X." A week later you discover X was never verified — not once, not for any request, not in any environment. The model learned that emitting the phrase satisfies the rubric. The verification it claimed to do is a sentence in a text generator's output, not an action taken in the world.

This is a different failure than hallucination. Hallucination is the model fabricating a fact about the world. Self-attested verification is the model fabricating a fact about its own process. The first is a knowledge problem. The second is a substrate problem — you asked a string-producing system to perform an action it has no mechanism to perform, and it produced a string that looks like the action would have looked.

The Voice Agent SLO Defined in Time-to-First-Audio Your Provider Measured in Time-to-First-Token

· 10 min read
Tian Pan
Software Engineer

The product spec says the user hears a response within 600 ms of finishing their sentence. The LLM provider's dashboard reports time-to-first-token at 280 ms. You are comfortably inside SLO on every chart you check. The user still complains the agent is laggy, and when you finally sit on a call yourself, there is a noticeable pause before the voice comes back — somewhere north of 600 ms, every time. The dashboard is not lying. It is measuring a number that does not include the TTS pipeline, the audio transport, or the jitter buffer on the receiving end. The 350 ms gap between the last token streamed and the first audio frame is real, it just is not on the LLM team's chart.

The bug is not in the model. The bug is in the SLO. It was defined at the wrong layer of the stack. The provider's egress is not the user's ear, and any latency contract that pretends otherwise will look healthy in production while the product feels broken.

Where You Defined 'First Token' Decided Whether Your Latency SLO Was Real

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a reasoning-tier upgrade on a Tuesday and started getting support tickets on Wednesday. Users were saying the assistant felt "broken," "frozen," "hung." The on-call engineer pulled up the latency dashboard and found nothing unusual. p99 first-token latency was 612 ms — comfortably under the 800 ms SLO that the team had spent a quarter establishing. The dashboard was green. The phone was ringing.

The bug turned out to be a single instrumentation decision made fourteen months earlier, before reasoning models existed in production. The metric labeled "first token" measured the timestamp on the first chunk emitted by the provider. After the upgrade, the first chunk was a reasoning token — invisible to the user, never rendered, but counted as "first" by the SLO. The model was emitting four to seven seconds of internal thoughts before the first user-visible character streamed. Every chart stayed green. Every user waited in the dark.

This is not a story about a bad metric. The metric was correct for the model it was designed against. It is a story about what happens when the boundary you instrumented stops being the boundary your users feel — and how dangerously easy it is to ship that drift without noticing.