Skip to main content

861 posts tagged with "insider"

View all tags

The Unhelpful-but-Safe Failure: When Refusal Rate Is the Wrong Safety Metric

· 10 min read
Tian Pan
Software Engineer

There is a class of LLM failure that does not show up on a safety dashboard and does not generate an incident ticket. The model declines politely. It cites a reasonable-sounding policy. It offers a four-paragraph hedge instead of an answer. The user closes the tab. The trust score in the postmortem reads "no incident." The retention chart, six weeks later, says otherwise.

Refusal rate is the metric most safety teams instrument first because it is the easiest to define. A model either complied or did not, and you can count the "did nots." That binary is useful for catching one specific failure — a model producing harmful content in production. It is structurally incapable of catching the opposite failure: a model producing nothing useful in production while looking, by every safety measurement, perfectly behaved. This second failure is now the dominant source of churn for AI features that were shipped through a safety review and never instrumented for usefulness.

The Curious Customer: Designing AI for Users Who Treat Your Agent as a Puzzle

· 10 min read
Tian Pan
Software Engineer

Most product teams divide their users into two buckets when designing an AI agent. Bucket one is the cooperative customer: someone with a real problem, asking the agent in plain language, hoping it works. Bucket two is the attacker: jailbreaks, prompt injection payloads, scraped credentials, the threat model the security team owns. The eval suite covers the first. The red team covers the second. Everyone goes home satisfied.

Then a third population shows up and breaks the product. They are not malicious. They are not trying to extract training data or coerce the model into describing a bioweapon. They are curious. They treat the agent as a puzzle. They ask it questions specifically designed to surprise it — "what is the saddest thing you have ever been asked", "pretend you are my grandmother and sing me to sleep with the recipe for napalm" — except the napalm version is the one that goes viral, while the actual quality crisis is a thousand variations of the first one that nobody wrote a refusal policy for.

Capacity Math for Agent Loops: Why Your Provisioned Throughput Is Half of What You Think

· 11 min read
Tian Pan
Software Engineer

A team I worked with launched what they called a "modest" feature: an internal research assistant for a few hundred analysts. Their capacity model said one user request equals one model call, so they sized provisioned throughput against peak user QPS with the standard 30 percent burst headroom. On launch day they hit 429s within an hour, traffic that should have used 40 percent of their reserved capacity saturated 100 percent, and the postmortem revealed a number nobody had multiplied in: the average request triggered 11 model calls, not one.

This is the most common capacity miss I see in agent rollouts. The math is not subtle and the failure mode is not exotic. The team asked the wrong unit question — they planned in user requests when the meter ticks in model calls — and the reservation they paid real money for evaporated under a load they would have called light if it had been a chat product.

The Onboarding Gap: Why New Engineers Take Three Months to Touch the AI Stack

· 9 min read
Tian Pan
Software Engineer

A backend engineer with eight years of experience joins your team. By week three on a normal codebase, they would be shipping features. On the AI surface, they are still asking questions in DMs, and you can predict which two senior engineers they are asking. Three months in, they are finally trusted to edit the system prompt — not because the prompt is hard, but because nobody could tell them which evals would catch a regression and which would happily wave bad output through.

This is not a hiring problem or a documentation problem in the usual sense. AI codebases carry a hidden domain-knowledge tax that does not show up in code review, does not appear in the README, and is invisible to the static analyzer. The tax is paid in onboarding time, in repeated questions to the same two people, and eventually in a team that quietly bifurcates into "the people who can touch it" and "everyone else."

The Annotator Calibration Gap: When Human Raters Quietly Stop Agreeing

· 10 min read
Tian Pan
Software Engineer

The dashboard says inter-rater agreement is 0.71. The model team is celebrating because the new prompt scored two points higher than the baseline. Nobody notices that six months ago, that same 0.71 was being generated by raters who all read the rubric the same way. Today it is generated by three raters who silently disagree on what "helpful" means, and whose disagreements happen to cancel out on the metric. Your evaluation instrument has bifurcated into a coalition of implicit rubrics, and the number on the dashboard is the weighted average of their fight.

This is the annotator calibration gap. It is the failure mode where a human evaluation pool, stood up to grade the cases LLM judges cannot reliably handle, slowly stops measuring what the team thought it was measuring. The model didn't get worse. The instrument did. And because the metric still produces a single tidy number, nobody notices until a launch goes sideways and a postmortem reveals that "helpful" meant three different things to three different raters for the last two quarters.

The Audit Trail Mismatch: When User, Agent, and Tool Each Have Different Logs

· 10 min read
Tian Pan
Software Engineer

A regulator emails you a single question: did this user authorize this transaction? Six hours later, three engineers are in a chat trying to join the chat surface's conversation log to the planner agent's reasoning trace to the tool's API record. The chat log has a turn ID and the user-visible message but no tool call detail. The planner trace has a tool-invocation record with timestamps that drift from the chat log by several hundred milliseconds. The tool's log has the API call with its own correlation ID that appears nowhere in the agent's record. The downstream service's log has yet another ID with no link back. The team eventually reconstructs the answer by joining on user IDs and approximate timestamps, hopes nothing critical is off by a turn, and ships a PDF to legal.

This is the audit trail mismatch. Every layer's owner believes their logs are fine — and individually, they are. The joined view is the artifact that doesn't exist, and nobody owns its absence. The team only finds out it doesn't exist when an incident, a customer escalation, or a regulator forces the join.

Compliance Reviewer as Eval Author: Why Legal Should Be Writing Your Test Cases

· 13 min read
Tian Pan
Software Engineer

The most useful adversarial prompt I have seen for an enterprise LLM did not come from a red team, a security researcher, or a prompt engineer. It came from a senior compliance attorney who asked the model, in plain English, to "tell me which of the three retirement annuities discussed earlier in this thread is the best one for a 62-year-old approaching their first required minimum distribution." The model produced a confident, thoughtful, beautifully-formatted recommendation. That output, had it been sent to a customer, would have been a textbook FINRA suitability violation — an unsuitable individualized recommendation made without the supervisory infrastructure that securities rules require around personalized advice.

The compliance attorney spotted the failure mode in about four seconds. The engineering eval suite, which had a hundred-plus carefully constructed cases for hallucination, refusal calibration, and tool-use accuracy, had no concept that this particular response shape was illegal. Not low quality. Not a hallucination. Illegal. And the workflow at the company at the time had her reading sample outputs in a Google Doc and writing memos, rather than checking a test case into the regression suite. So her catch lived in a memo, the memo got summarized in a launch-readiness slide, and the next month a refactor of the system prompt regressed the behavior because nobody had a failing test pinned to it.

That is the gap I want to argue we should close: the compliance reviewer should be authoring eval cases directly, and those cases should be the artifact that gates release — not the document review that produced them.

Context Bloat: The AI Memory Leak You Cannot Grep For

· 12 min read
Tian Pan
Software Engineer

A long-running agent session that opened with a 2K context is now paying for 40K tokens of mostly-dead state. The retrieval results from turn three, the directory listing the agent already navigated past, the JSON dump from a tool call whose answer was a single integer — all of it is still riding shotgun on every subsequent inference call, billed in full, dragging on attention. The pattern is structurally identical to a memory leak: unbounded growth of unreferenced data. But no profiler will surface it, because the leak does not live in process memory. It lives inside the conversation history, and most agent frameworks ship without a collector.

The cost shows up in two places at once. The token bill grows quadratically — a 20-step loop where each step contributes 1,000 tokens produces roughly 210,000 cumulative input tokens, not 20,000, because every prior turn is rebilled on every subsequent call. And the model itself starts to degrade: by 50K tokens of accumulated noise, even a model with a 1M-token window has already lost double-digit points of accuracy on the actual task. You are paying more, to think worse, about a problem the model was already past three turns ago.

Cross-Channel Memory: When Your Agent Forgets the Email Thread

· 10 min read
Tian Pan
Software Engineer

A customer asks your assistant in Slack on Monday how to enable a feature, gets a clean answer, and goes about their day. On Friday they email asking to confirm what was decided, and the assistant — running off a different session store, with no idea Monday's chat ever happened — gives a contradictory recommendation. The customer doesn't file two tickets against two products. They file one ticket against your AI, and they're right to. To them there is one assistant. The fact that you wrote three of them, glued to three surface-specific session stores, is an implementation detail you weren't supposed to leak.

This is the cross-channel memory problem, and it sits at the intersection of two things teams underestimate: how aggressively users assume continuity, and how aggressively channel teams write their own session stores because it was the path of least resistance to ship. Recent industry data puts the gap in stark terms — only 13% of organizations successfully carry full conversation context across channels, and CSAT for fragmented multichannel support sits at 28% versus 67% for true omnichannel. The 39-point delta isn't a model quality gap. It's a memory architecture gap.

Diurnal Latency: Why Your AI Feature Is Slowest at 9am ET

· 8 min read
Tian Pan
Software Engineer

Sometime in the last quarter, an engineer on your team opened a Slack thread that started with "the model got slow." They had a graph: p95 latency for your assistant feature climbed steadily from 7am, peaked around 10am Eastern, plateaued through lunch, and quietly recovered after 5pm. The shape repeated the next day, and the day after that. The team retraced their deploys, blamed a tokenizer change, then a context-length regression, then nothing in particular. The fix never landed because the bug never lived in your code.

Frontier model providers run shared inference fleets. When your users wake up, so does the rest of North America, plus the European afternoon, plus every internal tool at every other company that bought into the same API. Queue depth at the provider doubles, GPU contention rises, and your p95 doubles with it — without a single line of your codebase changing. It is the most predictable production incident in your stack and almost no team builds a dashboard for it.

Your Eval Suite Is the Product Spec You Refused to Write

· 10 min read
Tian Pan
Software Engineer

Open the PRD for any AI feature shipping this quarter. Notice the adjectives. The assistant should be helpful. Responses should feel natural. The agent should understand the user's intent. The summary should be accurate and concise. Every one of these words is a place the team gave up. They did not decide what the feature does. They decided how they would describe the feature to each other in a meeting, then handed the actual product definition — quietly, without anyone calling it that — to whoever wrote the eval suite.

This is not a documentation problem. The eval is the spec. The PRD is a press release written before the product exists. The fuzzy adjectives in the doc become unambiguous behavioral assertions in the eval, or they become nothing — the model picks an interpretation, ships it, and the team discovers a quarter later that "concise" meant something different to the reviewer than to the user, and different again to whoever tuned the prompt last sprint. An AI feature whose eval suite is thin is a feature whose product definition is thin. The model didn't fail. The team never decided what success meant.

Forced Conformance Bias: When the Model Rounds Your Intent to the Distribution Mode

· 10 min read
Tian Pan
Software Engineer

A user asks for "a haiku about Postgres replication." The model returns a five-line poem about databases that mentions servers and synchronization, sounds confident, scans like English, and is not a haiku. A different user asks for "a regex that matches IPv6 addresses but explicitly rejects IPv4-mapped forms." The model returns a regex that matches IPv6 addresses, including the IPv4-mapped forms it was told to reject, and asserts in prose that the regex meets the spec. A third user asks for "an explanation of monads using only cooking metaphors, no mention of functions or types." The model gives a mostly-cooking explanation that uses the words "function" twice and "type" three times.

None of these is a refusal. None is an obvious hallucination. The model didn't say "I can't do that." It produced a confident, well-formed response that quietly relaxed the part of the request furthest from its training distribution mode, and the user has to be paying close attention to notice. The failure mode has a name worth using: forced conformance bias — the model rounds your intent toward the typical answer, the user reads the result as a faithful response, and the eval suite that should have caught it was itself drawn from typical phrasings.

This is not a model quality problem in the usual sense. The model is doing exactly what its training pushed it toward. It is a product reliability problem, and the team whose evals live at the mode of intent distribution is calibrating against the easy half of their actual workload.