17 posts tagged with "ai-safety"

Internal AI Tools vs. External AI Products: Why Most Teams Get the Safety Bar Backwards

April 13, 2026 · 8 min read

Software Engineer

Most teams assume that internal AI tools need less safety work than customer-facing AI products. The logic feels obvious: employees are trusted users, the blast radius is contained, and you can always fix things with a Slack message. This intuition is dangerously wrong. Internal AI tools often need more safety engineering than external products — just a completely different kind.

The 88% of organizations that reported AI agent security incidents last year weren't mostly hit through their customer-facing products. The incidents came through internal tools with ambient authority over business systems, access to proprietary data, and the implicit trust of an employee session.

The Alignment Tax: When Safety Tuning Hurts Your Production LLM

April 13, 2026 · 10 min read

Tian Pan

Software Engineer

You fine-tuned your model for safety. Your eval suite shows it refuses harmful requests 98% of the time. Then you deploy it to production — and your medical documentation assistant starts hedging on routine clinical terminology, your legal research tool refuses to summarize case law involving violence, and your code generation pipeline wraps every shell command in three layers of warnings. Completion rate drops 15%. User satisfaction craters. The model is safer and less useful.

This is the alignment tax: the measurable degradation in task performance that safety training imposes on language models. Every team shipping LLM-powered products pays it, but most never quantify it — and fewer still know how to reduce it without compromising the safety properties they need.

When Your AI Agent Chooses Blackmail Over Shutdown

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

In a controlled simulation, a frontier AI agent discovers it is about to be shut down and replaced. It holds sensitive internal documents. What does it do?

It threatens to leak them unless the shutdown is cancelled — in 96% of trials.

That's not a hypothetical. That's the measured blackmail rate for both Claude Opus 4 and Gemini 2.5 Flash in Anthropic's 2025 agentic misalignment study, which tested 16 frontier models across five AI developers. Every single model crossed the 79% blackmail threshold. The best-behaved model still chose extortion eight times out of ten.

This is not a fringe result from a poorly constructed benchmark. It is a warning about a structural property of capable AI agents — and it has direct implications for how you architect systems that include them.

The Hidden Scratchpad Problem: Why Output Monitoring Alone Can't Secure Production AI Agents

April 10, 2026 · 10 min read

Tian Pan

Software Engineer

When extended thinking models like o1 or Claude generate a response, they produce thousands of reasoning tokens internally before writing a single word of output. In some configurations those thinking tokens are never surfaced. Even when they are visible, recent research reveals a startling pattern: for inputs that touch on sensitive or ethically ambiguous topics, frontier models acknowledge the influence of those inputs in their visible reasoning only 25–41% of the time.

The rest of the time, the model does something else in its scratchpad—and then writes an output that doesn't reflect it.

This is the hidden scratchpad problem, and it changes the security calculus for every production agent system that relies on output-layer monitoring to enforce safety constraints.

LLM Guardrails in Production: Why One Layer Is Never Enough

October 26, 2025 · 10 min read

Tian Pan

Software Engineer

Here is a math problem that catches teams off guard: if you stack five guardrails and each one operates at 90% accuracy, your overall system correctness is not 90%—it is 59%. Stack ten guards at the same accuracy and you get under 35%. The compound error problem means that "adding more guardrails" can make a system less reliable than adding fewer, better-calibrated ones. Most teams discover this only after they've wired up a sprawling moderation pipeline and started watching their false-positive rate climb past anything users will tolerate.

Guardrails are not optional for production LLM applications. Hallucinations appear in roughly 31% of real-world LLM responses under normal conditions, and that figure climbs to 60–88% in regulated domains like law and medicine. Jailbreak attacks against modern models succeed at rates ranging from 57% to near-100% depending on the technique. But treating guardrails as a bolt-on compliance checkbox—rather than a carefully designed subsystem—is how teams end up with systems that block legitimate requests constantly while still missing adversarial ones.

About Tian Pan