Skip to main content

907 posts tagged with "insider"

View all tags

The KV Cache Eviction Your Provider Called Cache Pressure and Your Bill Called a Doubled Prefix Charge

· 11 min read
Tian Pan
Software Engineer

Your application opens a long conversation with a forty-thousand-token system prompt and a full tool inventory. Turn 1 pays the prefix at write rates and the provider's KV cache warms up. Turn 2 arrives ninety seconds later. You assume it's a cache hit. Sometimes it is. Sometimes the same forty thousand tokens land on your invoice again at uncached prices, and nothing in your code changed between turn 1 and turn 2.

The thing that changed was somebody else's traffic. The KV cache is shared infrastructure. Your tenant was colocated on a serving node that, in the ninety seconds between your two turns, took on enough other tenants to evict your prefix from memory. The provider's dashboard will describe this as "cache pressure." Your finance team will describe it as a line item that doubled. Both descriptions are accurate. Neither is in your code.

The KV Cache Warm-Up Cron That Ran in Blue and Never in Green Because the Host Pinning Never Moved

· 11 min read
Tian Pan
Software Engineer

The incident review reconstructed a deployment from twelve days earlier as the cause of a 3.6× spend increase, and nobody on the call had been in the room when the change shipped. The deployment was routine: blue/green swap, traffic moved to green on schedule, blue decommissioned, the pipeline turned green, the release engineer closed the ticket. None of the production SLOs tripped. None of the application-layer alerts fired. The system ran exactly as designed.

What had been designed was a five-minute cron that pre-warmed the provider's prompt cache against the stable system-prompt prefix every five minutes. The warm-up gave the team a 91% cache hit rate on cold starts and roughly a 4× cost advantage on the first request per session. The cron had been authored a year ago when the blue/green pattern was first introduced, and its host selector was pinned to the blue pool to avoid running the warm-up twice during overlap windows. When green became the live color and blue went away, the cron lost its host and silently transitioned from "running every five minutes" to "running never." The cache hit rate decayed over the next 36 hours as the provider's cache TTL aged out the pre-warmed prefixes. The cost dashboard, averaging per-request cost across a daily window, smoothed the slope until the next billing cycle made it loud.

The Logprobs Field Your Provider Removed That Broke Your Confidence Router Silently

· 12 min read
Tian Pan
Software Engineer

The most expensive line in the postmortem was the one nobody wrote: a 200 OK with a missing field. The router that was supposed to escalate hard questions to the stronger model had been escalating zero percent of traffic for six weeks. The cost dashboard was celebrating. The quality dashboard was sliding, but only on the hard-question slice the standing eval set underweighted. Everything looked like a win until a customer complained about a specific kind of question the system used to handle correctly.

The cause was a response shape change one tier up the contract stack. The provider's mid-tier plan had dropped per-token logprobs as part of what the release notes called a "tier-specific feature parity adjustment." The client still received valid JSON. The HTTP status was still 200. The model identifier in the response matched the model identifier in the request. The only thing that changed was that the field the router consumed to make its escalation decision was no longer there, and the defensive default added during an incident a year earlier had quietly become the production default for every request.

The max_tokens Default Your Provider Raised That Doubled Your Tail Response Length

· 12 min read
Tian Pan
Software Engineer

Your incident timeline shows no deploys. Your code did not change. Your traffic mix did not change. Your prompts did not change. And yet your p99 output length doubled inside a week, your downstream rendering layer started clipping responses, and your output-token bill rose 38% on traffic that wasn't asking for longer answers. The change was real, the regression was measurable, and nothing in your version control system records it — because the value that moved was one your code never sent.

The provider raised an implicit default. The release notes filed it under "improved long-form behavior." The parameter in question was max_tokens, which your application has been omitting since day one because the documented default was generous and your outputs rarely came close. The default moved from 4096 to 8192 to accommodate longer reasoning in the provider's newer models. Your application got the new default whether you wanted it or not, because the absence of a parameter is itself a configuration choice — and the provider owns the right to change the value behind it.

This is the failure mode where a "no-op" release on the provider's side propagates through your system as a behavior change, a cost change, and a UX change all at once, and your team's only diagnostic signal is the bill arriving at the end of the month.

The MCP Server Your Team Forgot Was Running with Prod Credentials

· 10 min read
Tian Pan
Software Engineer

A new engineer joined the team on Monday. By Wednesday, she had a working local agent setup: an MCP server bridged to the company's deployment API, pointed at staging, talking to her editor. The onboarding doc walked her through the OAuth flow. The token she pasted into the server's environment file was the one her teammate had emailed her — the same token the CI pipeline uses to ship to staging. By Friday, she had joined the team for a working session at a coworking space.

The MCP server was still running. Bound to 127.0.0.1. No authentication. The token was loaded into the process. She didn't think about it because she was not using it. But any tab that visited any website that day could speak to her local server through her own browser. So could any other laptop on the coworking wifi, because she had not noticed that the server was actually bound to 0.0.0.0. The OAuth token your CI pipeline uses to push to staging was now reachable by anyone who could trick a browser into making a request to a local IP — which, in 2026, is a one-pop-up problem.

This post is about that class of failure: the gap between "I'm developing on my laptop" and "my laptop is a server reachable by adversaries." MCP servers, by design, sit right in that gap. Most teams have not noticed.

The Model Card Benchmark Whose Methodology Shifted While Your Contract Cited the Number

· 11 min read
Tian Pan
Software Engineer

Your procurement team renewed the inference contract last quarter and noted, with quiet satisfaction, that the quality clause referencing "HumanEval pass@1 of 84%" had been comfortably exceeded by the provider's latest model card, which now reports 87%. Three points to the good. The clause is satisfied. The relationship is healthy. Meanwhile, your inference team's own regression suite — the one that actually exercises the tasks your product depends on — shows a 2% decline on held-out evaluation cases since the model update shipped. Both numbers are real. Only one of them is in the contract.

This is what it looks like when a marketing artifact is load-bearing in a legal document. The benchmark number on the model card is the headline of a measurement; the methodology that produced it is a footnote in an appendix nobody on the contract review chain reads. When the provider changes the methodology — switches from greedy decode to best-of-three sampling, adds a structured-output system message, swaps the prompt template to match the model's new chat tuning — the number moves in a way that has nothing to do with your traffic and everything to do with how the number is computed. Your contract clause cites the number. The counterparty controls the protocol that produces it. You've signed a clause whose meaning the other side can revise without violating it.

The Model Identifier Your Provider Re-Pointed to a Finetune for One Tenant and to Base for Everyone Else

· 11 min read
Tian Pan
Software Engineer

A customer support team escalates: "Your assistant used to handle refund-eligibility questions correctly. Last week it started getting them wrong." The on-call engineer pulls a transcript, replays the exact prompt against the same model identifier in a dev account, gets the correct answer, and closes the ticket as "cannot reproduce." Two weeks later the same complaint shows up from a different customer. The engineer replays again, in the same dev account, and gets the correct answer again. The team starts blaming a prompt change nobody made.

The model identifier in the request never changed. The string in the response field matched the string in the request field. The eval suite stayed green for six weeks. The model serving production traffic was a different set of weights from the model serving the eval suite, and had been for the entire life of the account — except for the last six weeks, when it became the same set of weights and the team noticed only because a customer noticed first.

The Model Rollout Flag That Bucketed by Session and Drifted Your A/B Cohort

· 11 min read
Tian Pan
Software Engineer

The post-mortem opened with a sentence everyone in the room wanted to be true: the new model won by 4 percent on satisfaction, p less than 0.01, ship it. A month later a colder analysis found that the lift was a confound, the model was actually flat or slightly worse, and the team had spent the intervening weeks debating which prompt change had "caused" the win. Nothing about the model had caused anything. The experiment had been measuring the wrong thing because the flag service and the analysis pipeline disagreed, silently, about what a cohort was.

This is one of the most expensive failure modes in A/B testing because nothing in the system is broken. The flag service works. The experiment tracker works. The dashboard renders. The statistics are computed correctly on the data they receive. The failure lives in the seam between three components that each carry a different assumption about identity, and the seam is invisible until you go looking for it.

The Nightly Batch That Starved Your Interactive Traffic After a Quota Window Rewrite

· 11 min read
Tian Pan
Software Engineer

A cron job that ran cleanly for ten months is the most dangerous job in your system, because nothing in it changed and nothing in your code changed and the only thing that did change was a sentence in someone else's release notes that nobody on your team reads. The nightly embedding refresh that kicked off at 00:05 UTC every night, drained its work queue in under ten minutes, and went back to sleep was textbook. It coexisted with daytime interactive traffic by occupying the freshly-reset minute quota for a few minutes before users woke up, and by staying well under the daily allotment for the rest of the day. Then the provider rewrote how the daily window was accounted, kept the minute window unchanged, and left every signature your client tested against intact. The batch kept running clean. The interactive surface started returning 429s at 00:13 UTC every night. The team chased an upstream maintenance window that wasn't happening for a week.

The bug was never in your code. The bug was that "a daily limit" stopped meaning what it had meant the day before, and your scheduler was pinned to a wall-clock boundary that aligned with the old meaning. This post is about rate-limit accounting as a contract the provider can revise without breaking any signature, about how two independently-correct schedules compose into a denial-of-service pattern, and about the architectural moves that make a cron job stop being a time bomb wired to someone else's clock.

The OAuth Scope Your Agent Inherited When On-Behalf-Of Quietly Became Act-As

· 10 min read
Tian Pan
Software Engineer

The security review said the agent acts "on behalf of" the user. The OAuth token said something else, and the audit log agreed with the token.

A small distinction in language did a lot of architectural work nobody noticed. "On behalf of" is the language a security review reaches for when it wants to capture an arrangement where the agent is a delegate, recognizable as a delegate, and constrained by being a delegate. "Act as" is the runtime behavior when the agent holds a token indistinguishable from the user's own and is therefore the user as far as every downstream system can tell. These two phrases describe completely different threat models. A typical enterprise OAuth integration ships the second one and prices it as the first.

The On-Call Rotation That Muted LLM Pages Because Every One Looked Like The Last One

· 11 min read
Tian Pan
Software Engineer

A real regression burned for two days in production. The page had fired. It had fired correctly, at the right threshold, with the right severity. Three weeks earlier the on-call rotation had added a silence rule for that alert family because every page in that family had so far resolved with the same comment: "nothing to do, investigating." The post-mortem could not honestly call the silence a mistake. It was a rational adaptation to a stream of pages the rotation had no playbook for. The regression that mattered shipped against a muted channel because the team's monitoring stack was producing signals it could not act on, and the team's response to that was the only one available: stop listening.

This is not an alerting bug. It is a structural property of how AI features get instrumented when teams reach for the playbook they already know. Latency, error rate, refusal rate, output schema conformance, judge-eval drift — each one is a defensible metric. Each one fires with the same diffuse "model behavior changed" wording. None of them tells the on-call engineer what to do, because no one has written the runbook that maps each signal to an action, because most of the time the signal does not map to an action. The rotation absorbs the noise until the noise is louder than the signal, then it routes around the channel that produces it.

The Persona Your System Prompt Offered That the Model Picked the Same Way Every Time

· 10 min read
Tian Pan
Software Engineer

A product team I talked to recently ran a three-arm A/B test on response personas — concise, thorough, conversational — for three weeks across every cohort. The system prompt described all three and asked the model to pick the one that best matched the user. When they opened the dataset to write the readout, one number stopped them cold: the "thorough" arm had 91% of the traffic. The other two were rounding error.

Their experiment platform had not flagged anything. No alert fired. The pipeline did exactly what they had told it to do. Three weeks of supposed multi-persona testing had produced a dataset that could only tell them about thorough. The other two arms were too thin to power any inference at all.

The instinct in the room was that the prompt needed work — better instructions, sharper distinctions between personas, a more deliberate example for the conversational case. That diagnosis would have been right ten years ago in a rules-driven router. It is wrong for a model. The prompt was not the variable. The router was.