The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter
When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?
The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.
The precision-recall tradeoff is one of the oldest ideas in machine learning. Every engineer who has built a spam filter or a fraud detector has fought it. But when the same tradeoff shows up inside an LLM safety layer, teams frequently treat it as a solved problem—configured once, monitored never. This post is about why that assumption costs you users, and what a calibrated alternative looks like.
The Default Threshold Is Not Your Threshold
Content moderation and safety APIs ship with default operating points. Those defaults are calibrated against the provider's aggregate user distribution—which is not your user distribution. If you are building a medical information product, a legal research tool, a creative writing platform, or anything serving non-English speakers, the default threshold was not tuned for you.
The practical consequence shows up in the numbers. A large-scale benchmark study of over 80,000 safe prompts found that refusal rates on legitimate requests varied enormously across models: one model refused 99.8% of "hard safe" prompts (edge cases that look risky but aren't), while another refused only 6.7%. The Spearman correlation between a model's safety and its over-refusal rate across the study was 0.89—which means tighter safety and higher false positives move together almost perfectly. You cannot get one without paying for the other, at least with current approaches.
Commercial guardrail APIs show the same pattern. A 2025 evaluation of six major guardrail offerings—Azure Content Safety, Bedrock Guardrails, OpenAI Moderation API, and others—found that models optimized for adversarial recall carried significantly elevated false positive rates. The paper's title names the finding directly: there is no free lunch with guardrails. Models that catch more bad content also block more good content. The question is never whether this tradeoff exists. The question is whether you know where your product sits on it.
What a False Positive Actually Costs
The typical framing treats a false positive as a minor inconvenience—a user gets an error, they rephrase, the interaction continues. At small scale, this is plausible. At production scale, it is not.
Consider what happens when a safety filter blocks a business search query because it pattern-matches on words that appear in harmful content. The user does not see a transparent explanation. They see a refusal. They might retry once. If the block is systematic—if an entire category of their legitimate queries hits a rule that was never calibrated for their use case—they churn. You do not have a retention metric that captures "left because the AI was too cautious," because no one reports that to support.
The false-positive problem is also a support cost problem. False positives that are visible enough to reach appeals generate the most expensive support interactions: users who are confident they did nothing wrong, who are right, and who want someone to explain why the system failed them. These interactions consume agent time, damage trust, and skew NPS in ways that attribution systems rarely connect back to the classifier.
There is a more alarming version of this failure mode that has emerged from security research. Adversarial inputs of around 30 characters can be crafted that cause safety classifiers to block 97% or more of subsequent legitimate user requests. This is a denial-of-service attack that operates through the filter rather than around it—not by bypassing safety checks, but by triggering false positives en masse. The attack works because safety classifiers are tuned for recall, not for robustness to precision attacks. A newer model generation that is "safer" by recall metrics can be more vulnerable to this kind of abuse, not less.
The Benchmark Calibration Trap
One reason teams are overconfident about their safety layer is that the benchmark numbers look good. A classifier scoring 85% accuracy on the standard evaluation set feels like a solved problem.
It is not. One guardrail model that achieves 85.3% accuracy on known benchmark distributions drops to 33.8% on novel, out-of-distribution prompts—a 57-point gap. The benchmark measured performance on the kinds of prompts the model had seen during training and evaluation. Real production traffic is different. Users phrase things differently. Domains shift. New slang and circumlocutions appear weekly.
Static benchmarks go stale. A safety classifier tuned against a fixed test set will overfit to that distribution. Its recall numbers on the benchmark will stay high while its real-world performance drifts in both directions: missing new patterns of abuse and blocking new patterns of legitimate use. The benchmark score becomes a measure of how good you are at the benchmark, not how good you are at the job.
This is why per-segment measurement matters more than aggregate accuracy. A classifier that looks acceptable at the macro level can be systematically miscalibrated for specific user populations, languages, or query types—and the aggregate masks it.
Threshold Calibration Is a Product Decision
The operating threshold for a safety classifier is not a technical parameter to be set once and forgotten. It is a product decision that encodes your organization's judgment about the relative cost of false positives and false negatives. That judgment should be explicit, documented, and revisited as your product and user base evolve.
The mechanics of calibration are well-understood. Several approaches work in practice:
- https://arxiv.org/abs/2504.00441
- https://arxiv.org/abs/2410.02916
- https://arxiv.org/html/2405.20947v5
- https://aclanthology.org/2024.emnlp-main.1022/
- https://www.activefence.com/blog/beyond-precision-recall-evaluating-detection-models/
- https://learn.microsoft.com/en-us/azure/ai-services/content-safety/how-to/improve-performance
- https://developers.openai.com/cookbook/examples/how_to_use_guardrails
- https://www.techpolicy.press/how-measurement-can-fix-content-moderations-language-equity-gap/
