Skip to main content

861 posts tagged with "insider"

View all tags

User Trust Half-Life: Why One Bad Session Erases Weeks of Calibration

· 10 min read
Tian Pan
Software Engineer

A user's calibration of an AI feature is one of the most expensive things you ship. It costs them weeks of attention: learning which prompts work, where the model's reliable, when to double-check, what to ignore entirely. Then a single visible failure — a wrong number in a generated report, a hallucinated citation the user pasted into a deck, a confidently-incorrect recommendation they acted on — can vaporize all of it in one session. The recovery curve isn't symmetric. The user's prior was "this is reliable," and the update doesn't land as a data point. It lands as a betrayal.

The team measuring DAU sees nothing for weeks. The user keeps opening the app out of habit, runs a few queries, doesn't act on the output, and then quietly stops. By the time engagement metrics flinch, the trust event that caused it is two months old and nobody on the team remembers shipping it.

The Vendor SLA Gap: Why Your LLM Provider's Uptime Misses the Failure Mode That Breaks Your Product

· 9 min read
Tian Pan
Software Engineer

Your LLM provider says 99.95% availability. Your status page is green. Your latency dashboard is in the SLO. Your product is broken anyway — the assistant started refusing routine requests this morning, the JSON outputs that powered the downstream parser shifted from compact to chatty, and a third of the support tickets you triage with a model are coming back with "I can't help with that." Every one of those responses returned 200 OK in under 800ms. None of them violated the SLA. The SLA covered the failure mode you do not actually have.

This is the gap nobody priced into the procurement conversation. The vendor sells availability — a request-level promise that the API answered in time — and the product team consumes capability, which is a request-level promise that the answer was usable. The two are not the same metric, and the team that confuses them is one quiet model bump away from learning the difference.

On Intelligence, Chapter by Chapter: A 2004 Book That Predicted Half of Modern AI

· 133 min read
Tian Pan
Software Engineer

A 2004 book about brains argued that intelligence is, fundamentally, prediction. Twenty-two years later, the dominant paradigm in AI is literally trained to predict the next token. That book deserves another reading.

On Intelligence by Jeff Hawkins (with Sandra Blakeslee) is one of those rare technical books whose central claim has aged well in the most awkward way possible. The framework was right about what the brain does. It was almost certainly wrong about how you should engineer a machine to do it. And it is still the cleanest mental model I know for explaining why your LLM hallucinates with such confidence.

What follows is a chapter-by-chapter summary written for an engineer who is shipping AI features in 2026, not for a neuroscience seminar. I'll resist the temptation to relitigate every claim and just give you the spine, with a working engineer's annotation where the chapter has something to say about what you're building next week.

Agent Branch Coverage: Your Eval Hits the Happy Path, Not the Planner's If-Else

· 8 min read
Tian Pan
Software Engineer

A team I worked with last quarter ran a 240-case eval suite against their support agent. Green across the board for six months. Then they swapped a single sentence in the planner prompt — a tone tweak — and the next day production saw a 3× spike in human-handoff requests. The eval hadn't moved. The handoff branch had simply started firing on borderline cases that used to resolve in-line, and not a single eval case was the kind of borderline. The branch existed in the prompt. It existed in production. It did not exist in the eval.

This is the failure mode I want to name: agent branch coverage. Code-coverage tooling has been a debugging staple for forty years, but agentic systems have a runtime control flow — planner branches that pick a tool, condition the response, escalate to a human, refuse to act, retry with a different strategy — and the eval suite touches only the cases the team thought to write. Eighty percent of the planner's decision branches have never executed under test, and a green eval becomes a smoke test wearing a regression-test costume.

Agent Memory Eviction: Why LRU Survives a Model Upgrade and Salience Doesn't

· 9 min read
Tian Pan
Software Engineer

The team that ships an agent with salience-weighted memory eviction has, without realizing it, signed up for a memory migration project at every model upgrade. The eviction policy looks like a quality lever — pick the smartest scoring approach, get the best recall — but it is secretly a versioning contract. When the scoring model changes, the agent's effective past changes too. None of the tooling teams build around prompts and evals catches it, because the artifact that drifted is not a prompt or an eval. It is a sequence of decisions about what to forget, made months ago, by a model that no longer exists.

LRU and LFU don't have this problem. They are deterministic, model-independent, and survive upgrades cleanly. They also throw away information that a thoughtful judge would have kept. That is the tradeoff most teams accept once, on day one, when a demo recall metric is the only thing being measured — and it is the tradeoff that bites quarterly for the rest of the agent's lifetime.

The Agent Scratch Directory: The Unowned Filesystem PII Surface Nobody Inventoried

· 10 min read
Tian Pan
Software Engineer

A regulator walks into your office and asks the question security teams rehearse for: "Show me every place customer data lives." Your data team produces the inventory. The primary database is on it. The analytics warehouse is on it. The object store, the queue, the search index, the backup destination — all on it, with classification labels, retention policies, encryption details, and named owners. Then someone in the room mentions the agent worker pool, and the inventory has nothing to say. The pool has been running for nine months. Each worker has a local disk. The agents on those workers have been parsing PDFs, transcribing audio, downloading email attachments, and caching intermediate JSON between tool calls the entire time. Nobody put any of that on the asset register.

This is the scratch directory problem. Every long-running agent worker accumulates an ephemeral filesystem that grows organically as new tools are added — extracted text from a PDF parser, transcribed audio from a Whisper step, downloaded attachments from a Gmail tool, screenshots from a browser-use step, vector-search snippets cached for the next turn, intermediate JSON the agent emitted between two tool calls so the second one wouldn't have to re-derive it. Unlike databases and queues and buckets, this surface has no retention policy, no encryption-at-rest standard, no DLP scanner pass, and no entry on the data-classification spreadsheet. The platform team thinks "agent state" means the inference-provider context window. The SRE team thinks "agent state" means the durable database. The worker's /tmp/agent-workspace-${session_id}/ directory is a third copy of customer data that nobody owns.

The Inter-Team Token Budget War: When Your AI Platform Team Becomes a Treasury Department

· 10 min read
Tian Pan
Software Engineer

The team that built your internal LLM gateway scoped it for "rate limiting and audit." Eighteen months later, the same team is running a quarterly allocation meeting, mediating a quota dispute between two product groups, and discovering that the architecture they shipped to solve a capacity problem now functions as the company's internal AI treasury. Nobody chartered them for this role. Nobody removed it from their plate either.

This is the trajectory every AI platform team is on, and most of them get to the political economy phase before they have a policy, a sponsor, or even the telemetry to defend a decision. The technical work — request routing, key management, retries — is the easy half. The hard half is that finite provider quota plus three product teams with launch deadlines is a budget allocation system, and the team running the gateway is the one being asked to allocate.

The Difficulty Concentrator: AI Support Deflection Burns Out the Humans Left Behind

· 9 min read
Tian Pan
Software Engineer

The dashboard says everything is going well. Deflection up to 65 percent. Ticket volume down. Cost-per-contact halved. Then the support team starts quitting, and the exit interviews say something the dashboard has no column for: "every shift is the bad one."

This is the hidden mechanic of AI-augmented support. The deflection rate is not a measure of difficulty removed. It is a measure of difficulty concentrated. The cases that reach a human are no longer a representative sample of customer reality — they are the residue, the cases the AI couldn't close. And the residue is heavier than the average.

Browser Agent Session Bleed: When One Profile Serves Many Tenants

· 10 min read
Tian Pan
Software Engineer

A computer-use agent finishes a task on a customer's CRM, the worker pool returns the browser to its idle ring, the next request lands a few hundred milliseconds later, and the navigation to the dashboard succeeds — except it succeeds as the wrong user. The OAuth cookie from the previous session was still on the profile. The trace shows navigation succeeded, screenshot captured, action performed. Nothing in the run log says the agent was acting as someone who never asked it to.

This is the failure class that browser agents inherit silently from the libraries they're built on. Headless browser frameworks were designed for one user per profile because that's how a browser has worked for thirty years. When a worker pool reuses profiles to amortize the eight-second cold start of a fresh Chromium instance, that one-user assumption breaks, and the breakage is invisible to every layer of telemetry the team usually trusts.

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

Eval Datasets Are Customer Data With a Right Answer Attached

· 12 min read
Tian Pan
Software Engineer

Your golden eval set is a privacy boundary your security team didn't know existed. It is built by sampling production traces, which means it is a curated collection of real customer queries — often containing names, emails, account numbers, transcripts of frustrated calls, half-typed credit card digits — paired with the canonical correct response on top, and then committed to whatever bucket the eval pipeline reads from.

That last part is what makes eval data uniquely dangerous. A raw production trace is sensitive because it captures what the customer said. An eval case is sensitive in a new way because it captures what the customer said plus the labeled correct answer. The label is a derivative work that someone, often an annotator or a domain expert, applied with intent. It signals "this is canonical." It gives the trace a longevity that the original log never had — log retention will eventually rotate the trace out, but the eval case is now a permanent test fixture that the team is committed to keeping green.

Eval Selection Bias: Why Your Test Set Goes Blind to the Failures That Drove Users Away

· 10 min read
Tian Pan
Software Engineer

There is a quiet failure mode in production-grade LLM evaluation that no leaderboard catches: your test set is built from the users who stayed, so it never asks the questions that made the others leave. Quarter over quarter the eval scores climb, the dashboards turn green, and net retention sags anyway. The team chases "is the eval gameable?" when the real story is simpler and harder. The eval distribution drifted toward survivors, and survivors are exactly the population whose feedback you least need.

This is the WWII bomber armor problem in a new costume. Abraham Wald looked at returning planes, noticed where the bullet holes clustered, and pointed out that the holes you should reinforce against are the ones on planes that didn't come back. Replace bombers with users, replace bullet holes with failed turns, and you have the central pathology of eval sets seeded from production traces.