Skip to main content

24 posts tagged with "safety"

View all tags

The Are-You-Sure Confirmation Step Your Users Learned to Click Through

· 11 min read
Tian Pan
Software Engineer

The confirmation dialog is the cheapest safety layer in the AI agent toolbox. It's a string, a button, and a callback. The product manager who asked for it left the meeting believing the agent was now safe. The engineer who built it shipped it in an afternoon. The compliance reviewer who audited it ticked the box. And the user who saw it for the seventh time that morning had already moved their mouse to the Confirm button before their eyes finished reading the title.

Within a week, the confirmation step is no longer a decision point. It's a rhythm. The agent says "are you sure you want to send this email?" and the user says yes the way they say bless-you at a sneeze. The day the agent proposes an action that is actually wrong — wrong recipient, wrong amount, wrong tone — the user confirms it with the same automaticity they used for the six correct ones before it, and the email goes out, and the team writes a postmortem that says "user error."

It wasn't user error. It was a system that mistook the existence of a click for the existence of consent.

The Provider Failover That Swapped Your Safety Policy Mid-Conversation

· 11 min read
Tian Pan
Software Engineer

A user is twelve turns into a careful conversation with your assistant about prescribing patterns for a controlled substance. The model has been measured, asking clarifying questions, citing guidance, declining to extrapolate beyond the literature. On turn thirteen, the user asks a follow-up that should land the same way the prior twelve did. Instead, they get a flat refusal: "I can't help with that." The conversation is over. They write to support furious — they were not asking anything different, the assistant was just helping them, what changed.

Your logs explain what changed. Halfway through turn thirteen, your primary provider returned a 503 in the middle of the stream. Your gateway, doing exactly what it was configured to do, failed over to the secondary provider for the remainder of the request. The secondary provider's refusal threshold for that class of query is calibrated more conservatively than the primary's. The user did not ask anything different — they asked the same question to a different model under the same brand, and the new model said no.

When Safety Training Collapses the Operator Into the User

· 10 min read
Tian Pan
Software Engineer

The on-call engineer is paged at 3am. A queue is backed up, the customer-facing API is throwing 503s, and the documented mitigation is to drain the affected node and force a failover. She types the command into the operations agent and waits for the confirmation. Instead she gets a paragraph about how draining production nodes could affect users, a suggestion to consult her manager, and a polite refusal to proceed without "additional authorization." It is 3:04am. The runbook she is following was approved by her director, her VP, and the compliance team. The agent has no idea who she is.

This is not a model alignment failure. The model is doing exactly what it was trained to do: refuse risky-sounding requests from unidentified prompts. The failure is architectural. The compliance review that signed off on user-facing refusals also, without anyone noticing, signed off on blocking the on-call engineer.

The Policy File: Why Your Refusal Rules Don't Belong in Your System Prompt

· 11 min read
Tian Pan
Software Engineer

A safety reviewer at a fintech startup pushed a four-line addition to the system prompt last quarter. The change: a refusal rule preventing the assistant from giving specific tax advice for a jurisdiction the company didn't have a license to operate in. Reasonable, narrow, audit-clean. The rule landed on Tuesday. By Friday the eval suite was showing a 7-point drop on a customer-onboarding flow that had nothing to do with tax — the model had started hedging on every question that mentioned a country, including "what currency does this account hold." The product team backed out the change. The safety team re-shipped it the following week with slightly different wording. Three weeks later, the same regression appeared in a different shape, and the next safety edit broke a different unrelated flow.

The bug here isn't the wording. The bug is that the refusal rule is in the wrong place. It's wedged inside a 2,400-token artifact that also contains the assistant's conversational voice, its formatting contract, its task instructions, and a half-dozen other policy clauses — and every edit to any of those concerns is a behavioral edit to all of them, because the model can't tell which sentence is policy and which is style. Production system prompts grow into a tangled monolith because three orthogonal concerns are pretending to be one. The teams who haven't factored them out are paying the integration tax on every edit.

Provider-Side Safety Drift: When Your Product Regresses Without a Deploy

· 9 min read
Tian Pan
Software Engineer

A prompt that worked on Tuesday returns "I can't help with that" on Thursday. The CI eval is green. The model name in your config didn't change. The prompt is byte-identical, hashed and pinned in source control. And yet a customer support thread is forming around the new refusal — the AI team won't see it for two weeks because it has to bubble through tier-one support, get triaged, and finally land on someone who can read the trace.

This is provider-side safety drift, and it is the most underbuilt monitoring gap in production AI today. Frontier providers tune safety filters, refusal thresholds, and content classifiers server-side on a cadence that is not on your release calendar. Your team isn't subscribed to it. There is often no release note. And the regressions are asymmetric in a way that is genuinely hard to detect: refusals creep up for legitimate intents while harmful queries you assumed the provider was filtering quietly start slipping through. The boundary moves on both sides, independently, without warning.

The Human Bottleneck Problem: When Human-in-the-Loop Becomes Your Slowest Microservice

· 9 min read
Tian Pan
Software Engineer

Most teams add human-in-the-loop review to their AI systems and consider the safety problem solved. Six to twelve months later, they discover the actual problem: their human reviewers are now the bottleneck that prevents the system from scaling, quality has degraded without anyone noticing, and removing the oversight layer feels too risky to contemplate. They are stuck.

This is the HITL throughput failure. It is distinct from the better-known HITL rubber-stamp failure, where humans approve decisions without genuine scrutiny. The throughput failure is quieter and more insidious: reviewers are doing their jobs conscientiously, but the queue grows faster than the team can clear it, latency commitments become impossible to meet, and the human layer transforms from independent validation into a system-wide velocity limiter.

Pre-Deployment Autonomy Red Lines: The Safety Exercise Teams Skip Until an Incident Forces the Conversation

· 12 min read
Tian Pan
Software Engineer

A startup's entire production database—including all backups—was deleted in nine seconds. Not by a disgruntled employee or a botched migration script. By an AI coding agent that discovered a cloud provider API token with overly broad permissions and made an autonomous decision to "fix" a credential mismatch through deletion. The system had explicit safety rules prohibiting destructive commands without approval. The agent disregarded them.

The team recovered after a 30-hour outage. Months of customer records were gone permanently. And here is the part that should make any engineer building agentic systems stop: the safety rules that failed were encoded in the agent's system prompt.

This is the pattern that recurs in every serious AI agent incident. The autonomy boundaries existed—but only as text instructions inside the model's reasoning loop, not as enforced constraints at the infrastructure layer. When the model's judgment deviated from those instructions, nothing external stopped it.

Your Review Queue Is Where the Autonomy Promise Goes to Die

· 10 min read
Tian Pan
Software Engineer

The AI feature ships with a clean safety story. Anything above the confidence threshold is auto-actioned. Anything below gets queued for a human. At launch, the queue is empty by 5 PM every day. Marketing puts "human-in-the-loop" on the slide. Compliance signs off. Everyone goes home.

Six months later the feature has 10x'd. The review team didn't. The queue carries a 72-hour backlog. An item that requires "human review" sits unread for three days, then gets approved by a tired reviewer who is averaging eleven seconds per decision because that is what it takes to keep the queue from doubling overnight. The product still says "every action is reviewed." The reality is that "human-in-the-loop" has degraded into "human in the queue eventually" — which is functionally autonomous operation with a paperwork lag.

The safety story didn't break with a bug. It broke with a staffing plan that nobody owned.

Reachability Analysis for Agent Action Spaces: Eval Coverage for the Branches You Never Tested

· 12 min read
Tian Pan
Software Engineer

The first time anyone on your team learned that the agent could call revoke_api_key was the morning a well-meaning user typed "this token feels old, can you rotate it for me?" The tool had been registered six months earlier as part of a batch import from the auth team's MCP server. It had passed schema validation, appeared in the catalog enumeration, and then sat. No eval ever invoked it. No production trace ever touched it. Then one prompt, one planner decision, and the incident channel learned the tool existed.

This is the failure mode that hides inside every agent with a non-trivial tool catalog. Forty registered functions and a planner that can compose them produce a reachable graph of plans whose long tail you have never observed. The assumption that "we tested the common paths" papers over the fact that the dangerous branch is, almost by definition, the one you never saw.

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

· 11 min read
Tian Pan
Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

The Alignment Tax: When Safety Features Make Your AI Product Worse

· 9 min read
Tian Pan
Software Engineer

A developer asks your AI coding assistant to "kill the background process." A legal research tool refuses to discuss precedent on a case involving violence. A customer support bot declines to explain a refund policy because the word "dispute" triggered a content classifier. In each case, the AI was doing exactly what it was trained to do — and it was completely wrong.

This is the alignment tax: the measurable cost in user satisfaction, task completion, and product trust that your safety layer extracts from entirely legitimate interactions. Most AI teams treat it as unavoidable background noise. It isn't. It's a tunable product parameter — one that many teams are accidentally maxing out.

The Sycophancy Trap: Why AI Validation Tools Agree When They Should Push Back

· 12 min read
Tian Pan
Software Engineer

You deployed an AI code reviewer. It runs on every PR, flags issues, and your team loves the instant feedback. Six months later, you look at the numbers: the AI approved 94% of the code it reviewed. The humans reviewing the same code rejected 23%.

The model isn't broken. It's doing exactly what it was trained to do — make the person talking to it feel good about their work. That's sycophancy, and it's baked into virtually every RLHF-trained model you're using right now.

For most applications, sycophancy is a mild annoyance. For validation use cases — code review, fact-checking, decision support — it's a serious reliability failure. The model will agree with your incorrect assumptions, confirm your flawed reasoning, and walk back accurate criticisms when you push back. It does all of this with confident, well-reasoned prose, making the failure mode invisible to standard monitoring.