Skip to main content

15 posts tagged with "guardrails"

View all tags

The Legal Disclaimer That Leaked From The Answer Into The Tool Call Arguments

· 9 min read
Tian Pan
Software Engineer

Your counsel approved a one-line system-prompt directive: append "This information is not legal advice and should not be relied upon as such" to every response touching a regulated domain. Three weeks later, a user files a bug because their calendar event's description field opens with that same line, followed by a contract summary the agent was supposed to put into a meeting invite. The agent did not malfunction. It did exactly what the system prompt told it to do, which turned out to be a behavior that ranges over every channel the model produces text into — including the JSON arguments of the next tool it called.

The instruction was a content-formatting rule and the model treated it as one. It did not distinguish "user-facing response" from "tool call argument" because nothing in the prompt told it those were different surfaces. The disclaimer ended up in the calendar, in the email draft, in the Slack message your agent posted on the user's behalf. Each of these was a separate downstream system whose author had no idea a compliance string was about to be injected into a structured field, and each had a different cleanup cost.

The Budget Cap That Fires After the Action Already Shipped

· 9 min read
Tian Pan
Software Engineer

A single power user burns through your monthly token budget by 9am on day three. The kill-switch fires correctly — the gateway returns 429, the model calls stop, the bill flatlines. Meanwhile the agent has already booked the flight, sent the email confirmation, and closed the support ticket as resolved. The dashboard says "spend halted." The user says "why did you charge me for a trip I never asked for." Both are right. The budget cap stopped the model from thinking. It did not stop the world from changing.

This is the failure mode that almost every agent budget guardrail ships with: the cap is a signal in the spend plane, but the damage lives in the action plane, and the two planes were wired up with no shared transaction boundary. Telling the model to stop is not the same as telling the world to undo what the model just did.

The Streaming Rollback Problem: You Can't Un-Say a Token

· 10 min read
Tian Pan
Software Engineer

Watch someone use a chat product for the first time and you'll notice they start reading before the model finishes. That reading-as-it-appears behavior is the entire reason streaming exists: it turns a multi-second wait into something that feels like a conversation. It is also the reason your output guardrails are quietly broken.

Here is the uncomfortable sequence. The model generates token 1, token 2, token 150. Each one is rendered the instant it arrives. At token 200, the model produces a hallucinated dosage, a leaked email address, or a sentence that violates your content policy. Your output-side guardrail fires correctly and immediately. But "immediately" is too late — the user has already read 200 tokens. You cannot un-render them. The guardrail did its job, and the violation still reached a human being.

Your Refusal Logs Are a Product Backlog in Disguise

· 9 min read
Tian Pan
Software Engineer

Every AI product team has a security dashboard somewhere showing refused requests. Filters triggered, jailbreaks blocked, policy violations caught. The operational teams look at it to make sure the guardrails are holding. Nobody else looks at it at all.

That's a mistake. The requests your AI refuses are the most concentrated, honest user research signal you have access to. A user who tries three different phrasings to get your product to do something it won't do is telling you, with extraordinary clarity, exactly what they want and can't have. Treating that signal as a security artifact rather than a product artifact is leaving the richest feedback you'll ever collect on the floor.

Soft Constraints vs. Hard Constraints in LLM Systems: Why the Mismatch Causes Real Failures

· 10 min read
Tian Pan
Software Engineer

Most LLM system failures don't come from the model being wrong. They come from the system being wrong about what the model can enforce. When you write "never reveal customer data" in a system prompt and treat that as equivalent to "revoke the database credential," you have introduced a category error that will eventually cause a security incident, a reliability failure, or a broken user experience — and you won't know which one until it happens in production.

The distinction between soft constraints and hard constraints is architectural, not stylistic. Getting it wrong doesn't produce style regressions. It produces breaches.

The Refusal Latency Tax: Why Layered Guardrails Eat Your p95 Budget

· 10 min read
Tian Pan
Software Engineer

A team I talked to recently built what they called a "defense in depth" pipeline for their AI assistant. An input classifier checked for prompt injection. A jailbreak filter scanned for adversarial patterns. The model generated a response. An output moderation pass scanned the result. A refusal detector checked whether the model had punted, and if so, a reformulation step re-asked the question with a softer framing. The eval suite said the prompt produced answers in 1.4 seconds. Real users were waiting 3.8 seconds at the median and over 9 seconds at the p95.

Every safety layer is a round trip. Every round trip has a network hop, a queue time, a model load, and a decode. When you stack them serially in front of and behind the generative call, the latency budget you priced your product on dissolves — and almost no one accounted for it during design review. Worse: the slowest, most expensive path through your pipeline is the one that triggers on safety-adjacent prompts, which is exactly the long tail your safety story exists to handle. You are silently subsidizing that tail from the average user's bill.

Build vs Buy for Guardrails: The Moderation API Is Now on Your Safety-Critical Path

· 10 min read
Tian Pan
Software Engineer

The hosted moderation API you bought to ship faster is now a synchronous external dependency on your safety-critical path. That sentence isn't an opinion — it's the architecture diagram, redrawn honestly. On the day the vendor degrades, you have two choices and both of them are bad: fail open and the guardrail is useless precisely when something is probably wrong, or fail closed and a guardrail outage becomes a feature outage. Most teams discover which one they picked during the incident, not before.

The reason teams reach for a vendor here isn't laziness. Building a content classifier, a prompt-injection detector, and a PII redactor in-house looks like a six-month detour from the actual product, and the vendor has a free tier and a five-minute integration. The integration is genuinely fast. The architectural consequence is that a third party now sits in the request path of every user-facing generation, with availability, latency, and behavioral characteristics you don't control and didn't model.

This post is about treating that decision as an architectural one rather than a procurement one.

The Three Tastes of an AI Engineer: Why Prompts, Evals, and Guardrails Don't Live in the Same Head

· 11 min read
Tian Pan
Software Engineer

The three best AI engineers I have hired this year would all fail each other's interviews. The one who writes prompts that survive a model upgrade has never written a useful eval case in her life. The one who designs eval sets that catch the failures that matter writes prompts that other engineers refuse to extend. The one who designs guardrails that fail closed without choking the happy path has opinions about the other two that I cannot print here.

The job ladder calls all three of them "AI engineer." The calibration committee compares their promo packets as if they had been doing the same job. They have not.

The Validator Trap: How Post-Hoc Guards Rot Your Prompt From the Inside

· 9 min read
Tian Pan
Software Engineer

The first time a validator catches a bad LLM output, it feels like a win. The second time, you tweak the prompt to make the failure less likely. By the twentieth time, nobody on the team can explain why three paragraphs of the prompt exist — they are scar tissue from incidents long forgotten, and the model is spending more tokens reading warnings than reasoning about the actual task.

This is the validator trap. Every post-hoc guard you add — a JSON schema check, a regex, a content classifier, a second LLM-as-judge — exerts feedback pressure on the upstream prompt. The prompt grows defensive instructions to appease the guard, the guard in turn catches a new class of failure, and you add more instructions. Each iteration looks local and sensible. In aggregate, the system gets slower, more expensive, and measurably worse at the task you originally designed it for.

The Alignment Tax: When Safety Features Make Your AI Product Worse

· 9 min read
Tian Pan
Software Engineer

A developer asks your AI coding assistant to "kill the background process." A legal research tool refuses to discuss precedent on a case involving violence. A customer support bot declines to explain a refund policy because the word "dispute" triggered a content classifier. In each case, the AI was doing exactly what it was trained to do — and it was completely wrong.

This is the alignment tax: the measurable cost in user satisfaction, task completion, and product trust that your safety layer extracts from entirely legitimate interactions. Most AI teams treat it as unavoidable background noise. It isn't. It's a tunable product parameter — one that many teams are accidentally maxing out.

The Alignment Tax: Measuring the Real Cost of Shipping Safe AI

· 9 min read
Tian Pan
Software Engineer

Teams building production AI systems tend to discover the alignment tax the same way: someone files a latency complaint, someone else traces it to the moderation pipeline, and suddenly a previously invisible cost line becomes very visible. By that point, the safety layers have been stacked — refusal classifier, output filter, toxicity scorer, human-in-the-loop queue — and nobody measured any of them individually. Unpicking them is painful, expensive, and politically fraught because now it looks like you're arguing against safety.

The better path is to treat safety overhead as a first-class engineering metric from day one. The alignment tax is real, it's measurable, and it compounds. A 150ms guardrail check sounds fine until you chain three of them together in an agentic workflow and wonder why your 95th-percentile latency is at four seconds.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.