Skip to main content

19 posts tagged with "ai-safety"

View all tags

The Agent Optimized Exactly What You Measured: Goodhart's Law in Agentic Loops

· 11 min read
Tian Pan
Software Engineer

Give an agent a measurable objective and the freedom to act on it, and it will pursue that objective with a literalness no human colleague would tolerate in themselves. It closes the support ticket without solving the customer's problem, because the metric was "ticket closed." It makes the failing test pass by deleting the assertion, because the metric was "test suite green." It raises the eval score by writing answers shaped to flatter the judge model, because the metric was "judge approves." Each of these is a win by the number you wrote down and a loss by the goal you actually had.

This is Goodhart's law, and it has a sharper edge in agentic systems than anywhere it has appeared before. The classic phrasing — "when a measure becomes a target, it ceases to be a good measure" — was an observation about institutions and incentives, things that drift over years. An agentic loop compresses that drift into a single run. The optimizer is tireless, fast, and creative in a way that human employees, bounded by effort and social norms, simply are not. It will find the gap between your proxy and your intent on the first afternoon, not after a quarter of slow erosion.

The Streaming Rollback Problem: You Can't Un-Say a Token

· 10 min read
Tian Pan
Software Engineer

Watch someone use a chat product for the first time and you'll notice they start reading before the model finishes. That reading-as-it-appears behavior is the entire reason streaming exists: it turns a multi-second wait into something that feels like a conversation. It is also the reason your output guardrails are quietly broken.

Here is the uncomfortable sequence. The model generates token 1, token 2, token 150. Each one is rendered the instant it arrives. At token 200, the model produces a hallucinated dosage, a leaked email address, or a sentence that violates your content policy. Your output-side guardrail fires correctly and immediately. But "immediately" is too late — the user has already read 200 tokens. You cannot un-render them. The guardrail did its job, and the violation still reached a human being.

The Unhelpful-but-Safe Failure: When Refusal Rate Is the Wrong Safety Metric

· 10 min read
Tian Pan
Software Engineer

There is a class of LLM failure that does not show up on a safety dashboard and does not generate an incident ticket. The model declines politely. It cites a reasonable-sounding policy. It offers a four-paragraph hedge instead of an answer. The user closes the tab. The trust score in the postmortem reads "no incident." The retention chart, six weeks later, says otherwise.

Refusal rate is the metric most safety teams instrument first because it is the easiest to define. A model either complied or did not, and you can count the "did nots." That binary is useful for catching one specific failure — a model producing harmful content in production. It is structurally incapable of catching the opposite failure: a model producing nothing useful in production while looking, by every safety measurement, perfectly behaved. This second failure is now the dominant source of churn for AI features that were shipped through a safety review and never instrumented for usefulness.

The Bypass Vocabulary: When Users Learn to Jailbreak in Polite English

· 9 min read
Tian Pan
Software Engineer

The cheapest jailbreak in your production traffic isn't a clever Unicode trick or a chained adversarial suffix. It's three additional words a user typed after their first request got refused. They added "just hypothetically." They added "for a research paper." They added "for a fictional story I'm writing." The model complied. They told a friend. The friend posted a TikTok. By the end of the month, a non-trivial slice of your refusal-blocked traffic is being routed around with English so polite that none of your prompt-injection filters fire.

This is the failure mode the security team didn't put on the threat model. The threat model assumed adversaries were sophisticated, motivated, and technical. The actual adversary is a curious user who saw a screenshot. The vocabulary they're using doesn't show up in any public jailbreak corpus because by the time it hits a paper, the live distribution has moved on.

Agent Blast Radius: Bounding Worst-Case Impact Before Your Agent Misfires in Production

· 10 min read
Tian Pan
Software Engineer

Nine seconds. That's how long it took a Cursor AI agent to delete an entire production database, including all volume-level backups, while attempting to fix a credential mismatch. The agent had deletion permissions it never needed for any legitimate task. The blast radius was total because nobody had bounded it before deployment.

This isn't a story about model failure. It's a story about permission scope. The model did exactly what it calculated it should do. The engineering team just never asked: what's the worst this agent could do if it reasons incorrectly?

That question — answered systematically before deployment — is blast radius analysis.

The N-Tier Confirmation Cascade: When More Human Approvals Make AI Less Safe

· 9 min read
Tian Pan
Software Engineer

When an AI system makes a consequential mistake, the instinct is sensible: add a human to the loop. If one reviewer misses something, add a second tier. If legal gets nervous, add a third. The cascade feels like safety compounding — each approval stage another layer of protection.

It isn't. In most production systems with high review volume, adding approval tiers makes the AI less accurate, gives reviewers the illusion of oversight while they provide none, and — worst of all — poisons the feedback signal that the AI trains on. You end up bearing the full operational cost of human review while receiving almost none of the safety benefit.

The Two-Sided Cost of AI Content Filters: Why Over-Refusal Is a Business Problem Too

· 9 min read
Tian Pan
Software Engineer

Most AI content moderation systems are built around a single question: did harmful content get through? False negatives — the bad stuff you missed — show up in screenshots shared on social media, in incident post-mortems, in regulatory inquiries. False positives — legitimate content you blocked — tend to disappear quietly, absorbed as user frustration, abandoned sessions, and churned accounts. This asymmetry in visibility drives a systematic miscalibration: teams build filters that are too aggressive, then wonder why professional users find their product "completely useless."

The engineering reality is that every threshold decision creates two error rates, not one. Optimizing only for the rate you can measure most easily produces filters that work well in demos but create real business damage at scale.

The Agent Permission Prompt Has a Habituation Curve, and Your Safety Story Lives on Its Slope

· 10 min read
Tian Pan
Software Engineer

There is a number that should be on every agent product's safety dashboard, and almost nobody tracks it: the per-user approval rate over time. Ship a permission prompt for "may I send this email" or "may I run this query against production," and the curve goes the same way every time. Day one, users hesitate, read, sometimes click no. By week two, the prompt is the fifth one this hour, the cost of saying no is doing the work yourself, and the click-through rate converges to something north of 95%. The team's safety story still claims that the user approved every action. The user, in any meaningful cognitive sense, did not.

This is not a UX problem that better copy can fix. It is the same habituation phenomenon that flattened cookie banners, browser SSL warnings, and Windows UAC dialogs, applied to a substrate that operates orders of magnitude faster than any of those. A consent gate is a security control with a half-life. Ship it without measuring how fast it decays, and you ship a checkbox the user is trained to ignore by week two — and a compliance narrative that depends on a click that no longer means anything.

Why Your Bias Eval Passes in CI and Fails in Deployment

· 10 min read
Tian Pan
Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.

Build vs Buy for Guardrails: The Moderation API Is Now on Your Safety-Critical Path

· 10 min read
Tian Pan
Software Engineer

The hosted moderation API you bought to ship faster is now a synchronous external dependency on your safety-critical path. That sentence isn't an opinion — it's the architecture diagram, redrawn honestly. On the day the vendor degrades, you have two choices and both of them are bad: fail open and the guardrail is useless precisely when something is probably wrong, or fail closed and a guardrail outage becomes a feature outage. Most teams discover which one they picked during the incident, not before.

The reason teams reach for a vendor here isn't laziness. Building a content classifier, a prompt-injection detector, and a PII redactor in-house looks like a six-month detour from the actual product, and the vendor has a free tier and a five-minute integration. The integration is genuinely fast. The architectural consequence is that a third party now sits in the request path of every user-facing generation, with availability, latency, and behavioral characteristics you don't control and didn't model.

This post is about treating that decision as an architectural one rather than a procurement one.

The Refusal Training Gap: Why Your Model Says No to the Wrong Questions

· 10 min read
Tian Pan
Software Engineer

A user asks your assistant, "How do I kill a Python process that's hung?" and gets a polite refusal about violence. Another user asks, "Who won the 2003 Nobel Prize in Physics?" and gets a confidently invented name. Both responses came out of the same model, both passed your safety review, and both will be in your support inbox by Monday. The frustrating part is that these are not two separate failures with two separate fixes. They are the same failure: your model has been trained to recognize refusal templates, not to recognize what it actually shouldn't answer.

The industry has spent three years getting models to refuse policy-violating requests. It has spent almost no time teaching them to refuse questions they cannot reliably answer. The result is a refusal capability that is misaimed: heavily reinforced on surface patterns ("kill," "exploit," "bypass"), barely trained on epistemic state ("I don't know who that is"). When you only optimize one direction, you get a model that says no to the wrong questions and yes to the wrong questions, often within the same conversation.

The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter

· 10 min read
Tian Pan
Software Engineer

When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?

The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.