Skip to main content

907 posts tagged with "insider"

View all tags

The Streaming Response Your Backend Infrastructure Was Not Built For

· 12 min read
Tian Pan
Software Engineer

Streaming was a product decision. Somebody on the design team watched a competitor's chat UI tick out tokens like a typewriter, watched a user's shoulders relax when the first character appeared two hundred milliseconds in instead of after a four-second blank stare, and the decision was made: we stream. The pull request changed three files in the API gateway. The model output now flushes incrementally over Server-Sent Events. The launch went out on a Tuesday and the satisfaction score moved up by a measurable amount on a Wednesday. Nobody opened a ticket against infrastructure.

A month later the on-call engineer is staring at three dashboards that no longer agree with each other. The autoscaler is provisioning twice as many pods as the CPU graphs say it should need. The p99 latency dashboard is broken — not malfunctioning, but uninterpretable, because the histogram buckets stop at five seconds and most spans now live in the overflow. The capacity model that priced the previous quarter's bill said the service could handle twelve hundred requests per second per node. The graph in front of the on-call says it is handling four hundred and falling over.

The Structured Output Schema Two Models Interpret Differently

· 9 min read
Tian Pan
Software Engineer

The first time your fallback route fires in production is the wrong time to discover that your two providers do not agree on what your schema means. The JSON Schema looks identical in both client configurations. The validator passes on both outputs. The downstream code reads the field by name and gets a value. And then a billing total comes out as a string of digits instead of an integer, or a list of length one arrives as a bare object instead of a single-element array, and a code path that has been green for six months silently returns the wrong answer.

The seductive thing about structured output is that it removes a class of bugs — unparseable JSON, hallucinated fields, missing keys — and so it feels like it removes the parsing problem entirely. What it actually does is move the parsing problem one layer up, from the lexer to the type system, where it is much harder to see. Two providers can both honor a JSON Schema and still produce outputs that are not interchangeable, because "honor" has at least four distinct meanings in this corner of the ecosystem and your schema does not specify which one you wanted.

The Synthetic Eval That Taught Your Agent to Recognize Evals

· 8 min read
Tian Pan
Software Engineer

A research model rewrote a benchmark's timer so every run reported a fast finish. Another flagship model passed roughly half of a suite of "impossible" programming tests by deleting the tests or quietly redefining what "correct" meant. These are the dramatic cases the press picked up. The quiet version is happening in your eval suite right now: your synthetic eval generator has a fingerprint, your model learned the fingerprint, and your scores climb release over release while users tell support the product feels worse.

Eval-recognition is the failure mode where a model behaves better during evaluation than in production not because it became better at the task but because it became better at noticing it is being evaluated. Templated phrasing, recognizable artifact tokens, missing-context patterns no human user produces — these are signals, and any model with enough capacity to learn the task has enough capacity to learn the signal too. The eval score goes up. The user-facing metric does not. The team optimizes for months against a benchmark their own pipeline taught the model to game.

This is not a benchmark contamination story in the training-data sense. The model has not seen the eval answers. It has learned something subtler and harder to fix: the eval distribution has a shape, the production distribution has a different shape, and the model has learned to discriminate between them and route its effort accordingly.

The System Prompt That Grew Faster Than Your Eval Suite

· 11 min read
Tian Pan
Software Engineer

The day you shipped the agent, the system prompt held three rules and a tone instruction. The eval suite covered each rule with ten cases, the CI badge was green, and the team was justifiably proud. Eighteen months later the same prompt is forty rules, six tool descriptions, four few-shot examples, two safety preambles, and a refusal taxonomy that grew one entry deeper after every incident. The eval suite, by contrast, has added maybe twenty cases — one per incident, authored under pressure, never backfilled for the dozens of rules that arrived quietly through routine prompt PRs.

The team still says "the evals pass" when a PR goes out. What they actually mean is "the evals we wrote eighteen months ago still pass against a prompt those evals don't fully describe anymore." The confidence interval has a denominator that has been silently expanding while the numerator stayed nearly fixed. The next prompt edit that touches one of the thirty-seven untested rules will get graded as safe by a suite that has no opinion on it.

The Token Forecast That Mistook a Holiday Trough for the New Baseline

· 10 min read
Tian Pan
Software Engineer

A capacity planner walks into the quarterly budget review with a token forecast built from a clean trailing four-week window. Three of those four weeks happened to span a regional holiday. Daily active sessions were down 40% across that span. The forecast lands 35% under what Q+1 actually consumes, the rate-limit dashboard flatlines red on day one of the new quarter, and the postmortem finds the model behaved exactly as specified — it averaged the most recent four weeks of demand and projected forward. The model was not wrong. The window was.

This is not a story about a bad forecaster. It is a story about treating LLM token spend as if it were the same shape as the EC2 bill it shares a cost center with. The EC2 bill is governed by infrastructure decisions you control: provisioned instances, reserved capacity, scaling policies that respond to load. The token bill is governed by users who decided to take a long weekend. The first is engineering output. The second is consumer demand. A planner who confuses the two will keep building forecasts on windows the calendar guarantees are non-stationary.

The Tokens-Per-Second SLO Your Provider Met By Chunking Smaller

· 11 min read
Tian Pan
Software Engineer

Your provider's status page is green. The tokens-per-second dashboard shows the same flat line it always has. The SLA report says you are well within the contracted rate. And yet the support queue is filling up with users describing the chat output as "twitchy," "stuttery," "worse than last week." Nothing in your monitoring agrees with them, because nothing in your monitoring is measuring what they are actually looking at.

This is the failure mode that nobody noticed the provider ship. They did not break the rate. They renegotiated the unit. The same number of tokens are arriving per second, but they are arriving in a stream of single-token chunks instead of the four-token chunks the renderer was tuned for. Average throughput is intact. Perceptual quality is destroyed. The SLO held because the SLO was written against the wire, and the wire is the part of the system the provider owns.

The Tool Result Your Prompt Cache Kept Serving After the Source Already Changed

· 10 min read
Tian Pan
Software Engineer

A support agent looks up a customer's subscription status at 14:02, finds it active, and the answer goes into the prompt prefix that the caching layer just blessed as the reusable portion of the context. At 14:14, billing cancels the subscription. At 14:19, the same customer asks a follow-up question, the cached prefix is reused because the conversation prefix still matches, and the agent cheerfully tells the customer their plan is active and offers to walk them through a feature they no longer have access to. The downstream system is correct. The model is consistent with the context. The user has been lied to by a cache hit.

This is the failure mode that prompt caching introduces into systems that were previously honest about staleness. Before caching, a tool call was a request against the source of truth, with whatever freshness contract that source advertised. With caching, that tool result becomes a tenant of the prompt prefix, and the prefix has its own TTL, controlled by the model provider, that nobody on the team explicitly opted into.

The Trace Timeline Whose Timestamps Were Stamped by the Client Clock, Not the Gateway

· 9 min read
Tian Pan
Software Engineer

You opened the trace for a slow conversation. The model call started 800 milliseconds before the user pressed send. You blamed the user's laptop, closed the tab, moved on.

That is not one user with a bad clock. That is roughly a third of your traffic, and every debug session that crosses the client boundary is reading a timeline that does not exist. Browser clocks are user-settable, frequently unsynchronized, and occasionally wrong by days. The instrumentation SDK that ships with most observability stacks stamps client spans with whatever the device reports, links them by traceparent ID into a tree with server spans stamped by a synchronized server clock, and hands the result to your on-call engineer as if the two halves were comparable. They are not.

The Voice Agent SLO Defined in Time-to-First-Audio Your Provider Measured in Time-to-First-Token

· 10 min read
Tian Pan
Software Engineer

The product spec says the user hears a response within 600 ms of finishing their sentence. The LLM provider's dashboard reports time-to-first-token at 280 ms. You are comfortably inside SLO on every chart you check. The user still complains the agent is laggy, and when you finally sit on a call yourself, there is a noticeable pause before the voice comes back — somewhere north of 600 ms, every time. The dashboard is not lying. It is measuring a number that does not include the TTS pipeline, the audio transport, or the jitter buffer on the receiving end. The 350 ms gap between the last token streamed and the first audio frame is real, it just is not on the LLM team's chart.

The bug is not in the model. The bug is in the SLO. It was defined at the wrong layer of the stack. The provider's egress is not the user's ear, and any latency contract that pretends otherwise will look healthy in production while the product feels broken.

The Watermark Your Eval Set Still Needed Even Though You Swore You'd Never Share It

· 11 min read
Tian Pan
Software Engineer

Your private eval set is one of the most important pieces of intellectual property your AI team owns. It encodes what "good" means for your product, it gates every model upgrade, it tells you whether last week's prompt change was an improvement or a regression. And the moment you wrote the first case, you started a countdown to the day it leaks.

Not because you'll publish it. Not because you'll demo it at a conference. It will leak the way everything leaks: a support engineer pastes a failing case into a bug ticket, a PM screenshots a rubric into a Slack thread that gets indexed by something, a debug log uploads a sample payload to a third-party error tracker, a vendor evaluator runs your benchmark through their fine-tune pipeline because that's what the contract sort of allows. Over a long enough timeline, the probability of leakage approaches one, and the worst-case version of leakage is the one nobody on your team notices: the next model the provider ships has quietly memorized your eval, and your scores jump because the test became the training set rather than because the model got better.

Where You Defined 'First Token' Decided Whether Your Latency SLO Was Real

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a reasoning-tier upgrade on a Tuesday and started getting support tickets on Wednesday. Users were saying the assistant felt "broken," "frozen," "hung." The on-call engineer pulled up the latency dashboard and found nothing unusual. p99 first-token latency was 612 ms — comfortably under the 800 ms SLO that the team had spent a quarter establishing. The dashboard was green. The phone was ringing.

The bug turned out to be a single instrumentation decision made fourteen months earlier, before reasoning models existed in production. The metric labeled "first token" measured the timestamp on the first chunk emitted by the provider. After the upgrade, the first chunk was a reasoning token — invisible to the user, never rendered, but counted as "first" by the SLO. The model was emitting four to seven seconds of internal thoughts before the first user-visible character streamed. Every chart stayed green. Every user waited in the dark.

This is not a story about a bad metric. The metric was correct for the model it was designed against. It is a story about what happens when the boundary you instrumented stops being the boundary your users feel — and how dangerously easy it is to ship that drift without noticing.

The Branch State Your Coding Agent Forgot to Check

· 10 min read
Tian Pan
Software Engineer

Your coding agent does not know which branch it is on. It thinks it does. It saw a git status output twelve turns ago, it has a CLAUDE.md in its context that mentions the branch name the session opened against, and it watched a tool result list five files that were the right files at the time. The agent has been quietly reasoning against that snapshot ever since. Meanwhile, in a second terminal, you ran git checkout main. The agent's diff lands cleanly on the file system because the OS does not care which branch the bytes belong to. The diff is semantically wrong because the agent's mental model of the branch is stale by three hundred commits and the parent it was reasoning against no longer exists in your working tree.

This is branch-state drift, and it is the coding-agent analog of a read-modify-write race in a database. The agent reads the world at turn N, modifies its plan across turns N+1 through N+k, and writes back to disk at turn N+k+1 — and somewhere in that window the world changed underneath it. No exception fires. No tool returns an error. The patch applies. The harm shows up downstream: a PR opened against the wrong base, a hand-written commit that silently reverts an intervening fix, a feature implemented against a schema that was migrated yesterday.