Skip to main content

907 posts tagged with "insider"

View all tags

The Token Count Your Client Estimated And Your Provider Invoiced

· 12 min read
Tian Pan
Software Engineer

Your application counted tokens locally with a tokenizer library matching what you believed the provider used. The SDK reported "estimated 4,200 tokens" before each call. Your budget logic admitted the request. Then the provider's invoice came back at 6,800 tokens for the same payload. Multiply that 60% gap by a few million calls a month and the line item your finance team cannot reconcile against your own logs starts to look like an architectural mistake rather than a rounding error.

The mistake is not that the local tokenizer was wrong. The mistake is treating the local tokenizer as a contract instead of a guess. Tokenization is something the provider does inside their serving stack — your library is a model of that process, not the process itself, and the two drift in ways that are small per call and structural across the population of calls you actually make.

The Tokenizer Upgrade That Invalidated Every Prompt Cache Prefix

· 9 min read
Tian Pan
Software Engineer

The release notes were two lines long. "Improved multilingual tokenization. No breaking changes to model outputs." Nine words. Your evals confirmed it: same prompts, same completions, same scores. Your platform team signed off on the upgrade Friday afternoon. By Tuesday morning your cache hit rate had collapsed from 80% to 4%, your daily inference bill had quadrupled, and the on-call engineer who paged you at 6am could not find a single line of your code that had changed.

Nothing in your code had changed. The provider had shipped a new tokenizer that split one Unicode glyph one byte differently than the old one. Every cached prefix in your system was now fingerprinted against a token sequence that no longer existed. The model behaved identically — that was true. The cache layer, which the release notes did not mention, paid the bill in full.

The Tool Description That Drifted Out of Sync With the Tool It Described

· 12 min read
Tian Pan
Software Engineer

A backend engineer renames a parameter from user_id to account_id because the two stopped being the same thing six months ago, and a support ticket finally made the ambiguity intolerable. The JSON schema for the tool gets updated in the pull request that ships the rename. The tool's prose description — the one paragraph the model actually reads to decide whether to call the tool and how — lives in a different repository, owned by a different team, updated through a ticket queue, and still reads "pass the user_id to look up the account." Nobody flags it. The model dutifully calls the tool with the right schema, fills the right field, and gets the right answer on every single happy-path query. The bug is invisible until the day a user types something where their authenticated user_id and the account_id they were asking about are two different entities, and the agent confidently returns somebody else's data.

The traceparent header your gateway dropped between LLM call and tool execution

· 11 min read
Tian Pan
Software Engineer

A user reports that the agent answered correctly but the database update never happened. You open your observability tool, search for the trace ID stamped on the user-facing conversation, and find a clean tree — five LLM calls, four tool decisions, a final response. No errors. Then you search for the tool service that owns the database write, and you find another trace, with the same wall-clock window but a different trace ID, a different root span, and no link back. You search the gateway logs. Three more orphan traces. The agent run that looked like a single coherent interaction in the chat UI fragmented, in your tracing backend, into a forest.

The header that should have stitched it together is traceparent. It is a 55-byte W3C-standard string that every span in a distributed system uses to identify its parent. It is also, in most production LLM agent stacks, dropped at least once between the user's request and the side effect the user actually wanted.

The Transcription Confidence Score Your Agent Trusted After the Vendor's Recalibration

· 10 min read
Tian Pan
Software Engineer

The voice agent had a gate. Anything above 0.85 transcription confidence went straight to the planning step; anything below got routed to a human. The threshold had been tuned six months earlier against a labeled corpus of real customer calls, frozen into a config file, and forgotten. For six months it did exactly what it was supposed to do. Then the transcription provider shipped a model upgrade — same API, same response shape, same latency band, same documented accuracy — and over the next two weeks the agent started authorizing wire transfers to the wrong people.

"Transfer $50 to mom" became "transfer $5,000 to Tom." The new transcript came back with a confidence of 0.91, well above the gate. The downstream planner saw a confident transcript and acted on it. The customer's appeal eventually surfaced the bug, but by then the support queue had filtered out a week's worth of similar incidents as fraud disputes. The post-mortem traced the gap to a single decision the team had never made explicitly: that 0.85 from the old model and 0.85 from the new model were the same number.

Your Latency SLO Is a Function of Other Teams' Prompt Sizes

· 10 min read
Tian Pan
Software Engineer

Your chat product has been running quietly at a 1.5-second p99 latency SLO for months. The request rate is flat, the prompt sizes are flat, the model has not changed. Then, on a Tuesday afternoon, p99 jumps to 4.8 seconds and stays there. The on-call investigation finds no anomaly in the chat path: same requests-per-minute, same median prompt of around 800 tokens, same retry behavior on the SDK. The deploy log for the chat service is empty for the day. The breach lasts six hours.

The cause is in another team's repo. That morning, a long-document summarization feature shipped on the same organization key, with average prompts of 12,000 tokens. Their request rate is modest — a few hundred per minute — but each call burns through the shared tokens-per-minute budget fifteen times faster than yours. The provider's throttle fires on the chat path because the chat path was holding the same bucket the summarization team just emptied. Nobody changed your code, nobody breached anyone's planned capacity, and your SLO is now a function of a workload your team has never read.

Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't

· 9 min read
Tian Pan
Software Engineer

The team that ships "AI for EU customers" usually ships exactly one residency control: an inference endpoint pinned to an EU region. The procurement team gets a DPA, the architecture diagram gets a green checkmark next to "model hosted in Frankfurt," and the launch proceeds. What the diagram doesn't show is that the customer's verbatim query gets vectorized by a US-hosted embedding API on its way to the model, that the vector store the query is matched against has its operational plane in us-east-1, that the rerank model is a third-party SaaS deployed wherever the vendor chose, that the prompt cache is keyed regionally on hits and globally on misses, and that the trace store logging the retrieved chunks has a 30-day retention bucket that replicates cross-region for redundancy.

The inference layer respects residency. The retrieval pipeline doesn't even know it's a participant.

This is the gap where most "GDPR-compliant" RAG deployments fail an audit the team didn't realize was coming. The fix isn't another control on the model call — it's recognizing that data residency is a property of every component the customer's bytes touch, and that the team owning "the LLM" owns at most one of the six surfaces involved.

The 429 Whose Body Said OK And Your Client Believed The Body

· 9 min read
Tian Pan
Software Engineer

The outage started at 14:03 with a 429 from the provider and a JSON body that said {"status": "ok", "data": null}. The client library was written in a hurry six months ago by someone who had been burned twice before — once by a gateway that returned HTTP 200 with an error field, and once by a provider that returned HTTP 500 on a request that had actually succeeded. So the library learned to trust the body, not the status. The status said throttle. The body said proceed. The client believed the body, fired the next request, got another 429 with another ok, fired again, and by 14:11 the provider's circuit breaker had blacklisted the account for the rest of the hour.

The provider hadn't lied, exactly. The 429 was real. But somewhere in the response pipeline a default envelope had been merged over the rate-limit payload — a generic {"status": "ok"} from a wrapper service that filled missing fields, applied on top of an error the wrapper didn't recognize. The status code was correct, the headers were correct, the body was wrong, and the body was the part the client read.

The A/B Test Powered by Token Counts Instead of Outcomes

· 13 min read
Tian Pan
Software Engineer

A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.

The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

· 10 min read
Tian Pan
Software Engineer

A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.

This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.

The Agent Rollout Cadence Your Customer Success Team Could Not Absorb

· 11 min read
Tian Pan
Software Engineer

The customer pasted the agent's answer into a support chat and asked the human rep to confirm it. The rep, looking at the same product, said the opposite. The customer did not lose trust in the agent that day. They lost trust in the company, because two parts of it told them two different things in the same hour.

Nothing was broken. The AI team had shipped a prompt change on Tuesday behind a feature flag, ramped it to 100% by Thursday, and moved on. The customer success team's enablement cycle is monthly — that is how every other product feature has always landed, and nobody re-negotiated the contract for AI. The macro in the CS rep's queue and the FAQ doc on the public site still described the previous behavior. The agent was correct. The rep was correct against the documentation they had. The company was incoherent.

The AI Feature Your CTO Funded That Your Security Team Will Not Let You Ship

· 11 min read
Tian Pan
Software Engineer

The post-mortem says "we found security too late." The actual finding is that security found you on time. Your process found security too late.

This is the AI feature that cleared the budget gate in January because the CTO and the CFO agreed the company needed an AI moment. It cleared a light legal review in March because it was a prototype. Engineering built against the agreed spec through Q2. In late July, the launch-readiness security review opened, and on day one the threat model came back with blockers on the auth scopes, the data-exfiltration paths, the model provider's residency story, and the prompt-injection surface. The team's quarter is now spent rebuilding to address findings that should have shaped the original spec. Two quarters of slip, an executive memo about "process improvements," and a quiet decision next planning cycle to "deprioritize AI deep-integrations."

The launch did not fail because security was slow. It failed because security entered after the shape of the feature had already been frozen.