Skip to main content

722 posts tagged with "insider"

View all tags

The Difficulty Concentrator: AI Support Deflection Burns Out the Humans Left Behind

· 9 min read
Tian Pan
Software Engineer

The dashboard says everything is going well. Deflection up to 65 percent. Ticket volume down. Cost-per-contact halved. Then the support team starts quitting, and the exit interviews say something the dashboard has no column for: "every shift is the bad one."

This is the hidden mechanic of AI-augmented support. The deflection rate is not a measure of difficulty removed. It is a measure of difficulty concentrated. The cases that reach a human are no longer a representative sample of customer reality — they are the residue, the cases the AI couldn't close. And the residue is heavier than the average.

Browser Agent Session Bleed: When One Profile Serves Many Tenants

· 10 min read
Tian Pan
Software Engineer

A computer-use agent finishes a task on a customer's CRM, the worker pool returns the browser to its idle ring, the next request lands a few hundred milliseconds later, and the navigation to the dashboard succeeds — except it succeeds as the wrong user. The OAuth cookie from the previous session was still on the profile. The trace shows navigation succeeded, screenshot captured, action performed. Nothing in the run log says the agent was acting as someone who never asked it to.

This is the failure class that browser agents inherit silently from the libraries they're built on. Headless browser frameworks were designed for one user per profile because that's how a browser has worked for thirty years. When a worker pool reuses profiles to amortize the eight-second cold start of a fresh Chromium instance, that one-user assumption breaks, and the breakage is invisible to every layer of telemetry the team usually trusts.

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

Eval Datasets Are Customer Data With a Right Answer Attached

· 12 min read
Tian Pan
Software Engineer

Your golden eval set is a privacy boundary your security team didn't know existed. It is built by sampling production traces, which means it is a curated collection of real customer queries — often containing names, emails, account numbers, transcripts of frustrated calls, half-typed credit card digits — paired with the canonical correct response on top, and then committed to whatever bucket the eval pipeline reads from.

That last part is what makes eval data uniquely dangerous. A raw production trace is sensitive because it captures what the customer said. An eval case is sensitive in a new way because it captures what the customer said plus the labeled correct answer. The label is a derivative work that someone, often an annotator or a domain expert, applied with intent. It signals "this is canonical." It gives the trace a longevity that the original log never had — log retention will eventually rotate the trace out, but the eval case is now a permanent test fixture that the team is committed to keeping green.

Eval Selection Bias: Why Your Test Set Goes Blind to the Failures That Drove Users Away

· 10 min read
Tian Pan
Software Engineer

There is a quiet failure mode in production-grade LLM evaluation that no leaderboard catches: your test set is built from the users who stayed, so it never asks the questions that made the others leave. Quarter over quarter the eval scores climb, the dashboards turn green, and net retention sags anyway. The team chases "is the eval gameable?" when the real story is simpler and harder. The eval distribution drifted toward survivors, and survivors are exactly the population whose feedback you least need.

This is the WWII bomber armor problem in a new costume. Abraham Wald looked at returning planes, noticed where the bullet holes clustered, and pointed out that the holes you should reinforce against are the ones on planes that didn't come back. Replace bombers with users, replace bullet holes with failed turns, and you have the central pathology of eval sets seeded from production traces.

Hyrum's Law for Streamed Reasoning: Pacing, Pauses, and Intermediate Tokens Are an Undocumented Contract

· 11 min read
Tian Pan
Software Engineer

A team upgrades from a frontier model to its faster successor. The eval suite is green. Final answers match. Tool-call schemas are identical. The structured outputs validate against the same JSON schema they always did. They ship. Within a day, support tickets pile up: "the assistant feels rushed," "it's not really thinking anymore," "something is off." The product manager pulls telemetry and finds task-completion rates unchanged. The engineering team double-checks the eval and the schema and finds nothing wrong. The complaint is real, but the contract — as the team defined it — is intact.

What changed is the texture of the stream. The old model paused for 800 milliseconds before calling a tool, emitted a "Let me check that..." preamble, and dribbled tokens at roughly 35 per second with natural-feeling clusters around clause boundaries. The new model emits tokens at 90 per second, never pauses, and skips the preamble entirely. None of that was in any documented contract. All of it was load-bearing.

This is Hyrum's law, and streaming makes its surface area enormous. Any observable behavior of your system will be depended on by somebody — and a streaming AI surface exposes far more observable behavior than the team realizes.

The MCP Cold Start Tax: How Tool-Server Overhead Compounds by Agent Step 7

· 11 min read
Tian Pan
Software Engineer

A 200-millisecond tool call looks like noise on a flame graph. Stack seven of them in an agent loop and the noise becomes the signal — the model finishes thinking in 800ms but the user waits 4.5 seconds because every tool invocation re-pays a startup cost the first call already absorbed. The cruel part is that this cost doesn't show up in any single trace as anomalous. It shows up as the difference between a snappy demo and a sluggish production agent, and most teams blame the model.

The Model Context Protocol has become the default integration surface for agent tooling, which means it has also become the default place where latency goes to die. MCP's design — JSON-RPC over stdio or streamable HTTP, capability negotiation, dynamic tool discovery — is correct for a protocol that has to bridge arbitrary clients and servers. But the per-call cost structure it implies is hostile to the access pattern that agents actually have, which is not "one tool call per session" but "seven tool calls per turn for forty turns per session."

This post is about that mismatch: where the cold start tax actually lives, why it compounds rather than amortizes in long-running agents, and the warm-pool discipline that turns a multi-second penalty into a sub-100ms one.

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

· 11 min read
Tian Pan
Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts

· 11 min read
Tian Pan
Software Engineer

Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.

The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.

The Quantization Quality Cliff: When int4 Passes the Median Eval and Fails on the Long Tail

· 11 min read
Tian Pan
Software Engineer

A team swaps an fp16 model for an int4 quantization to halve serving cost. The eval suite scores within a point of the original on the curated test set. The rollout ships under the rationale "indistinguishable on the benchmark." Six weeks later, support is fielding catastrophic-failure quotes from regulated customers — code that compiles to nonsense, low-resource-language responses that drift into another script, multi-hop arithmetic that confidently returns numbers off by an order of magnitude. The benchmark didn't lie. It just measured the median, and quantization is not a uniform tax on the median. It is a non-uniform tax on the tail.

This is the quantization quality cliff: the moment your eval suite, your rollout discipline, and your cost-savings narrative all simultaneously fail because the metric you used to approve the swap had no signal on the capabilities you destroyed. Recent benchmarks make the magnitude concrete. On long-context tasks, 8-bit quantization preserves accuracy with roughly a 0.8% drop, while 4-bit methods lose up to 59% on the same workload — a regression invisible to any test set that doesn't oversample tail inputs. Median moved one point. Tail moved fifteen, or thirty, or fifty.

The Regional Model Rollout Lottery: When Your Product Quietly Behaves Differently by Continent

· 11 min read
Tian Pan
Software Engineer

A customer-success email lands on a Friday afternoon: "the model got worse for our German users." The team pulls up the eval dashboard. Scores are flat. Latency p95 is normal. The model name in the config is the same one shipped three weeks ago. Nothing changed. Except something did. The US endpoint quietly received the new model generation last sprint, the EU endpoint is still on the prior version because the provider hasn't completed the regional rollout yet, and the load balancer in front of both has been hiding the gap from every dashboard the team owns.

This is the regional model rollout lottery. Your "single model" abstraction is not single. It bifurcates the moment a provider stages a release across continents — which is most of the time, for most providers, in most years. The version string in your client SDK does not change when this happens. Your traces look identical. Your contract with the provider does not promise otherwise. And your eval suite, the artifact you trust to catch behavioral regressions, is almost certainly running from a CI box that lives in one region and hits whichever endpoint is geographically closest.

Right-to-Erasure Meets Fine-Tuning: When Deletion Stops at the Snapshot

· 11 min read
Tian Pan
Software Engineer

A customer files a subject-access request asking for their data to be deleted. The data engineer purges the production database, the analytics warehouse, the support ticket archive, the cold-storage backups. Every system the legal team listed in the data inventory comes back clean. Then somebody in the room asks the question that nobody wants to answer first: what about the model?

Three months ago that customer's support transcripts went into a fine-tuning run. The resulting adapter has been serving predictions to other customers ever since, with their phrasing, their account names, occasionally their literal sentences embedded in the weights. You can prove deletion in the warehouse. You cannot prove deletion in the model — and the more honest member of the team is the one who says so out loud.