Skip to main content

578 posts tagged with "insider"

View all tags

The Agent Permission Prompt Has a Habituation Curve, and Your Safety Story Lives on Its Slope

· 10 min read
Tian Pan
Software Engineer

There is a number that should be on every agent product's safety dashboard, and almost nobody tracks it: the per-user approval rate over time. Ship a permission prompt for "may I send this email" or "may I run this query against production," and the curve goes the same way every time. Day one, users hesitate, read, sometimes click no. By week two, the prompt is the fifth one this hour, the cost of saying no is doing the work yourself, and the click-through rate converges to something north of 95%. The team's safety story still claims that the user approved every action. The user, in any meaningful cognitive sense, did not.

This is not a UX problem that better copy can fix. It is the same habituation phenomenon that flattened cookie banners, browser SSL warnings, and Windows UAC dialogs, applied to a substrate that operates orders of magnitude faster than any of those. A consent gate is a security control with a half-life. Ship it without measuring how fast it decays, and you ship a checkbox the user is trained to ignore by week two — and a compliance narrative that depends on a click that no longer means anything.

Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression

· 10 min read
Tian Pan
Software Engineer

The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.

The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.

A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.

The Demo Was a Single Seed: Why Your AI Rollout Is a Variance Problem, Not a Polish Problem

· 11 min read
Tian Pan
Software Engineer

The exec demo went perfectly. The model answered the curated question, the agent completed the workflow, the screen recording is saved on the company drive, and the launch date is now in the calendar. Six weeks later the rollout craters and the post-mortem narrative writes itself: the model needed more polish, the prompt needed more iteration, the team underestimated the work between prototype and production.

That narrative is wrong, and it's expensive, because it sends the team back to do more of the work that already failed. The demo wasn't an under-polished version of production. It was a single sample from a distribution the team never measured. The wow moment was one realization out of thousands the model would generate against the same input, and the team shipped the best one as if it were the typical one. The gap between demo and prod isn't quality drift. It's variance the team hadn't yet seen.

This reframing matters because the fix for a variance problem looks nothing like the fix for a polish problem. Polish says "iterate the prompt, tune the model, hire a better PM." Variance says "you don't know what you have until you sample it n times across the input distribution." The two diagnoses produce different roadmaps, different budgets, and different incident patterns. The teams that ship reliably in 2026 know which problem they have.

AI Shadow IT: When Product Teams Build Their Own LLM Proxy

· 11 min read
Tian Pan
Software Engineer

The shadow IT incident your platform team is going to investigate in Q3 already happened in January. It looks like this: a senior engineer on a product team has a launch this month. The platform team's "official" LLM gateway is on the roadmap for "next quarter." So the engineer creates a corporate credit card OpenAI account, drops the API key into a .env file, ships the feature, and hits the public deadline. The launch is a success. Six months later, the FinOps team finds three vendor accounts nobody can attribute, the security team finds prompts containing customer data routed to a region not covered by the data processing agreement, and the platform team discovers the gateway it spent two quarters building has 14% adoption because every team that needed AI shipped without it.

This is not a security failure or a discipline failure. It is a platform-product velocity mismatch, and treating it as anything else guarantees the next gateway you ship will have the same adoption problem.

The 'Try a Bigger Model' Reflex Is a Refactor Smell

· 10 min read
Tian Pan
Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

Confidence Strings, Not Scores: Why Your 0.87 Badge Moves Nobody

· 10 min read
Tian Pan
Software Engineer

The product team ships a confidence badge next to every AI suggestion. Green for ≥85%, yellow for 60–84%, red below. They run an A/B test six weeks later and find no change in user behavior at any threshold. False positives at 0.92 confidence get accepted at the same rate as false positives at 0.61 confidence. The team's instinct is to tune the calibration — fit a temperature scaling layer, regenerate the badges, run the A/B again. The numbers shift; the behavior doesn't.

The problem isn't that the model is miscalibrated, though it almost certainly is. The problem is that calibrated probability is the wrong output. The signal a user can act on isn't "how sure" the model is. It's "what specifically the model didn't check." A 0.87 badge tells the user nothing they can verify. "I'm reasonably confident in the address but I haven't checked the unit number" tells them exactly where to look.

Cross-Team Agent SLAs Don't Compose: The 99% Math Your Org Forgot to Budget

· 11 min read
Tian Pan
Software Engineer

Team A's agent advertises a 99% success rate. Team B's agent advertises 99%. The new joint workflow that calls both lands at 98% on a good day, 96% on a bad one — and the team that owns the joint workflow is now the de facto SRE for two systems they don't own, can't reproduce locally, and didn't write the eval set for. Each upstream team is hitting its SLO. The composite product is missing its SLO. Nobody's pager is ringing on the right side of the boundary.

This is the math of independent failure rates, and it has been hiding in plain sight ever since the org started letting agents call each other. Five components at 99% reliability give you 95% end-to-end. Ten components give you 90%. A 20-step process at 95% per-step succeeds 36% of the time — more than half of operations fail before completion. By the time a workflow chains 50 components — not unusual once an enterprise agent starts calling sub-agents that call tool agents — a system where every individual piece is "99% reliable" will fail roughly four out of ten requests.

Researchers analyzing five popular multi-agent frameworks across more than 150 tasks identified failure rates between 41% and 87%, with the top three failures being step repetition, reasoning–action mismatch, and unawareness of termination conditions — and unstructured multi-agent networks have been observed to amplify errors up to 17× compared to single-agent baselines. The math isn't subtle. The problem is that the org's SLO sheets, dashboards, on-call rotations, and PRDs are still scoped one agent at a time.

Your Gold Eval Set Has Drifted and Its Pass Rate Is the Reason You Can't See It

· 12 min read
Tian Pan
Software Engineer

The gold eval set passes at 94%. The model has been bumped twice this quarter, the prompt has been edited eleven times, the tool catalog has grown by four, and the dashboard is still green. Then a sales engineer forwards a transcript where the agent confidently routes a customer to a workflow that was sunset two months ago, and the head of support quietly opens a thread asking why the satisfaction scores have been sliding for six weeks while the eval pipeline reports no regressions. The gold set isn't lying. It's measuring last quarter's product against this quarter's traffic, and nobody asked it to do anything else.

This is the failure mode evaluation systems make hardest to see, because the instrument that's supposed to detect quality regressions is itself the source of the false positive. Pass rate is computed against the items in the set; the items in the set were curated against a snapshot of usage; usage moved on; the rate stayed clean. The team trusts the green dashboard, ships another model upgrade, and discovers months later that the production distribution has been measuring something different than the eval set has been measuring for longer than anyone wants to admit.

The fix is not to refresh the gold set more often. Refresh cadence is the wrong knob; the right knob is having a second instrument calibrated to a different time window so disagreement between the two surfaces drift before users do. That second instrument is the shadow eval — a parallel set rebuilt continuously from current production traffic, run alongside the gold set, with the explicit job of disagreeing with it.

The LLM SDK Upgrade Tax: Why a Patch Bump Is a Model Rollout in Disguise

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped a regression to production at 2:14 a.m. on a Tuesday. The on-call alert fired because the JSON parser downstream of their summarization agent was rejecting one in twenty responses with a trailing-comma error. The model hadn't changed. The prompt hadn't changed. The eval suite had passed at 96.4% the night before, comfortably above the 95% gate. What had changed was a single line in package.json: the model provider's SDK had moved from 4.6.2 to 4.6.3. Patch bump. Auto-merged by the dependency bot. The release notes said "internal cleanups."

The "internal cleanup" was a tightened JSON-mode parser that now stripped a forgiving fallback path, which had been quietly fixing a recurring trailing-comma quirk in the model's tool-call output. The model's behavior was unchanged. The SDK's interpretation of that behavior was not. The team's eval suite never saw the regression because the eval suite ran against a different SDK version than the one the dependency bot had just promoted.

This is the LLM SDK upgrade tax, and it is one of the quietest, most expensive failure modes in production AI today. The SDK is not a passive transport. It is an active participant in your prompt's behavior, and the team that upgrades it without an eval is doing a model rollout in disguise.

Your APM Is Quietly Dropping LLM Telemetry, and the Bug Lives in the Gap

· 11 min read
Tian Pan
Software Engineer

There is a broken prompt in your system right now that affects roughly three percent of traffic, and your dashboards do not know it exists. The p99 latency chart is green. The error rate is flat. The model-call success metric is at four nines. The only place the failure shows up is in a customer support ticket the platform team cannot reproduce, and by the time the ticket reaches a debugging session, the trace has been sampled away.

This is not a monitoring gap. It is a category mistake. The APM you are running was designed for a world in which dimensions are bounded sets — endpoint, status_code, region, service — and the cost of an additional label is at most a few new time series. LLM workloads do not fit that shape at all. The interesting dimensions are the user's prompt, the retrieved context IDs, the tool-call sequence, the model revision, the prompt template version, the tenant, the locale, the eval bucket the request fell into. Every one of those is high-cardinality, and any subset of them is enough to detonate the metrics store the moment you tag a span with it.

LLM Model Routing Is Market Segmentation Disguised As A Cost Optimization

· 10 min read
Tian Pan
Software Engineer

The cost dashboard makes the case for itself. Sixty percent of traffic is "easy," a quick eval shows the smaller model lands within a couple of points on the global accuracy metric, and the routing layer ships behind a feature flag the same week. The graph bends. Finance is happy. The team moves on.

What nobody tracks is that the customer who hit the cheap path on Tuesday afternoon and the expensive path on Wednesday morning is now using two different products. The two models fail differently. They format differently. They refuse different things. They handle ambiguity, follow-up questions, and partial inputs with different defaults. From the customer's seat, the assistant developed amnesia overnight and nobody can tell them why — because internally, the change was filed as a finops win, not a product release.

Multilingual Eval Cost Amplification: Why Seven Locales Doesn't Cost 7×

· 14 min read
Tian Pan
Software Engineer

The financial planning spreadsheet for the international launch had a clean line item: "extend eval coverage to seven new locales — assume 7× current eval cost." The English eval suite took two weeks and $40K to build, so seven locales would be $280K and a quarter of engineering time. The CFO signed it. The VP of Product signed it. The launch shipped.

Six months later the actual eval bill had crossed $310K and the team was still standing up the last two locales. The labeling vendor had churned through three replacements for the Portuguese-Brazilian pool because the first two kept producing inter-rater agreement scores an honest review would call random. The German judge model was scoring 6% lower than the English one on the same content — the team initially read this as a German model regression until a manual audit revealed the judge itself was the regression. And the eval lead was spending forty percent of their week on a question nobody had budgeted: how do we know when locale A's pass rate is actually worse than locale B's, versus when our cross-locale measurement is just noisier than the gap?