Skip to main content

12 posts tagged with "guardrails"

View all tags

Your Refusal Logs Are a Product Backlog in Disguise

· 9 min read
Tian Pan
Software Engineer

Every AI product team has a security dashboard somewhere showing refused requests. Filters triggered, jailbreaks blocked, policy violations caught. The operational teams look at it to make sure the guardrails are holding. Nobody else looks at it at all.

That's a mistake. The requests your AI refuses are the most concentrated, honest user research signal you have access to. A user who tries three different phrasings to get your product to do something it won't do is telling you, with extraordinary clarity, exactly what they want and can't have. Treating that signal as a security artifact rather than a product artifact is leaving the richest feedback you'll ever collect on the floor.

Soft Constraints vs. Hard Constraints in LLM Systems: Why the Mismatch Causes Real Failures

· 10 min read
Tian Pan
Software Engineer

Most LLM system failures don't come from the model being wrong. They come from the system being wrong about what the model can enforce. When you write "never reveal customer data" in a system prompt and treat that as equivalent to "revoke the database credential," you have introduced a category error that will eventually cause a security incident, a reliability failure, or a broken user experience — and you won't know which one until it happens in production.

The distinction between soft constraints and hard constraints is architectural, not stylistic. Getting it wrong doesn't produce style regressions. It produces breaches.

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently built what they called a "defense in depth" pipeline for their AI assistant. An input classifier checked for prompt injection. A jailbreak filter scanned for adversarial patterns. The model generated a response. An output moderation pass scanned the result. A refusal detector checked whether the model had punted, and if so, a reformulation step re-asked the question with a softer framing. The eval suite said the prompt produced answers in 1.4 seconds. Real users were waiting 3.8 seconds at the median and over 9 seconds at the p95.

Every safety layer is a round trip. Every round trip has a network hop, a queue time, a model load, and a decode. When you stack them serially in front of and behind the generative call, the latency budget you priced your product on dissolves — and almost no one accounted for it during design review. Worse: the slowest, most expensive path through your pipeline is the one that triggers on safety-adjacent prompts, which is exactly the long tail your safety story exists to handle. You are silently subsidizing that tail from the average user's bill.

Build vs Buy for Guardrails: The Moderation API Is Now on Your Safety-Critical Path

· 10 min read
Tian Pan
Software Engineer

The hosted moderation API you bought to ship faster is now a synchronous external dependency on your safety-critical path. That sentence isn't an opinion — it's the architecture diagram, redrawn honestly. On the day the vendor degrades, you have two choices and both of them are bad: fail open and the guardrail is useless precisely when something is probably wrong, or fail closed and a guardrail outage becomes a feature outage. Most teams discover which one they picked during the incident, not before.

The reason teams reach for a vendor here isn't laziness. Building a content classifier, a prompt-injection detector, and a PII redactor in-house looks like a six-month detour from the actual product, and the vendor has a free tier and a five-minute integration. The integration is genuinely fast. The architectural consequence is that a third party now sits in the request path of every user-facing generation, with availability, latency, and behavioral characteristics you don't control and didn't model.

This post is about treating that decision as an architectural one rather than a procurement one.

The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head

· 11 min read
Tian Pan
Software Engineer

The three best AI engineers I have hired this year would all fail each other's interviews. The one who writes prompts that survive a model upgrade has never written a useful eval case in her life. The one who designs eval sets that catch the failures that matter writes prompts that other engineers refuse to extend. The one who designs guardrails that fail closed without choking the happy path has opinions about the other two that I cannot print here.

The job ladder calls all three of them "AI engineer." The calibration committee compares their promo packets as if they had been doing the same job. They have not.

The Validator Trap: How Post-Hoc Guards Rot Your Prompt From the Inside

· 9 min read
Tian Pan
Software Engineer

The first time a validator catches a bad LLM output, it feels like a win. The second time, you tweak the prompt to make the failure less likely. By the twentieth time, nobody on the team can explain why three paragraphs of the prompt exist — they are scar tissue from incidents long forgotten, and the model is spending more tokens reading warnings than reasoning about the actual task.

This is the validator trap. Every post-hoc guard you add — a JSON schema check, a regex, a content classifier, a second LLM-as-judge — exerts feedback pressure on the upstream prompt. The prompt grows defensive instructions to appease the guard, the guard in turn catches a new class of failure, and you add more instructions. Each iteration looks local and sensible. In aggregate, the system gets slower, more expensive, and measurably worse at the task you originally designed it for.

The Alignment Tax: When Safety Features Make Your AI Product Worse

· 9 min read
Tian Pan
Software Engineer

A developer asks your AI coding assistant to "kill the background process." A legal research tool refuses to discuss precedent on a case involving violence. A customer support bot declines to explain a refund policy because the word "dispute" triggered a content classifier. In each case, the AI was doing exactly what it was trained to do — and it was completely wrong.

This is the alignment tax: the measurable cost in user satisfaction, task completion, and product trust that your safety layer extracts from entirely legitimate interactions. Most AI teams treat it as unavoidable background noise. It isn't. It's a tunable product parameter — one that many teams are accidentally maxing out.

The Alignment Tax: Measuring the Real Cost of Shipping Safe AI

· 9 min read
Tian Pan
Software Engineer

Teams building production AI systems tend to discover the alignment tax the same way: someone files a latency complaint, someone else traces it to the moderation pipeline, and suddenly a previously invisible cost line becomes very visible. By that point, the safety layers have been stacked — refusal classifier, output filter, toxicity scorer, human-in-the-loop queue — and nobody measured any of them individually. Unpicking them is painful, expensive, and politically fraught because now it looks like you're arguing against safety.

The better path is to treat safety overhead as a first-class engineering metric from day one. The alignment tax is real, it's measurable, and it compounds. A 150ms guardrail check sounds fine until you chain three of them together in an agentic workflow and wonder why your 95th-percentile latency is at four seconds.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.

AI-Assisted Incident Response: Giving Your On-Call Agent a Runbook

· 9 min read
Tian Pan
Software Engineer

Operational toil in engineering organizations rose to 30% in 2025 — the first increase in five years — despite record investment in AI tooling. The reason is not that AI failed. The reason is that teams deployed AI agents without the same rigor they use for human on-call: no runbooks, no escalation paths, no blast-radius constraints. The agent could reason about logs, but nobody told it what it was allowed to do.

The gap between "AI that can diagnose" and "AI that can safely mitigate" is not a model capability problem. It is a systems engineering problem. And solving it requires the same discipline that SRE teams already apply to human operators: structured runbooks, tiered permissions, and mandatory escalation points.

LLM Guardrails in Production: Why One Layer Is Never Enough

· 10 min read
Tian Pan
Software Engineer

Here is a math problem that catches teams off guard: if you stack five guardrails and each one operates at 90% accuracy, your overall system correctness is not 90%—it is 59%. Stack ten guards at the same accuracy and you get under 35%. The compound error problem means that "adding more guardrails" can make a system less reliable than adding fewer, better-calibrated ones. Most teams discover this only after they've wired up a sprawling moderation pipeline and started watching their false-positive rate climb past anything users will tolerate.

Guardrails are not optional for production LLM applications. Hallucinations appear in roughly 31% of real-world LLM responses under normal conditions, and that figure climbs to 60–88% in regulated domains like law and medicine. Jailbreak attacks against modern models succeed at rates ranging from 57% to near-100% depending on the technique. But treating guardrails as a bolt-on compliance checkbox—rather than a carefully designed subsystem—is how teams end up with systems that block legitimate requests constantly while still missing adversarial ones.

LLM Guardrails in Production: What Actually Works

· 8 min read
Tian Pan
Software Engineer

Most teams ship their first LLM feature, get burned by a bad output in production, and then bolt on a guardrail as damage control. The result is a brittle system that blocks legitimate requests, slows down responses, and still fails on the edge cases that matter. Guardrails are worth getting right — but the naive approach will hurt you in ways you don't expect.

Here's what the tradeoffs actually look like, and how to build a guardrail layer that doesn't quietly destroy your product.