The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter
When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?
The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.
The precision-recall tradeoff is one of the oldest ideas in machine learning. Every engineer who has built a spam filter or a fraud detector has fought it. But when the same tradeoff shows up inside an LLM safety layer, teams frequently treat it as a solved problem—configured once, monitored never. This post is about why that assumption costs you users, and what a calibrated alternative looks like.
The Default Threshold Is Not Your Threshold
Content moderation and safety APIs ship with default operating points. Those defaults are calibrated against the provider's aggregate user distribution—which is not your user distribution. If you are building a medical information product, a legal research tool, a creative writing platform, or anything serving non-English speakers, the default threshold was not tuned for you.
The practical consequence shows up in the numbers. A large-scale benchmark study of over 80,000 safe prompts found that refusal rates on legitimate requests varied enormously across models: one model refused 99.8% of "hard safe" prompts (edge cases that look risky but aren't), while another refused only 6.7%. The Spearman correlation between a model's safety and its over-refusal rate across the study was 0.89—which means tighter safety and higher false positives move together almost perfectly. You cannot get one without paying for the other, at least with current approaches.
Commercial guardrail APIs show the same pattern. A 2025 evaluation of six major guardrail offerings—Azure Content Safety, Bedrock Guardrails, OpenAI Moderation API, and others—found that models optimized for adversarial recall carried significantly elevated false positive rates. The paper's title names the finding directly: there is no free lunch with guardrails. Models that catch more bad content also block more good content. The question is never whether this tradeoff exists. The question is whether you know where your product sits on it.
What a False Positive Actually Costs
The typical framing treats a false positive as a minor inconvenience—a user gets an error, they rephrase, the interaction continues. At small scale, this is plausible. At production scale, it is not.
Consider what happens when a safety filter blocks a business search query because it pattern-matches on words that appear in harmful content. The user does not see a transparent explanation. They see a refusal. They might retry once. If the block is systematic—if an entire category of their legitimate queries hits a rule that was never calibrated for their use case—they churn. You do not have a retention metric that captures "left because the AI was too cautious," because no one reports that to support.
The false-positive problem is also a support cost problem. False positives that are visible enough to reach appeals generate the most expensive support interactions: users who are confident they did nothing wrong, who are right, and who want someone to explain why the system failed them. These interactions consume agent time, damage trust, and skew NPS in ways that attribution systems rarely connect back to the classifier.
There is a more alarming version of this failure mode that has emerged from security research. Adversarial inputs of around 30 characters can be crafted that cause safety classifiers to block 97% or more of subsequent legitimate user requests. This is a denial-of-service attack that operates through the filter rather than around it—not by bypassing safety checks, but by triggering false positives en masse. The attack works because safety classifiers are tuned for recall, not for robustness to precision attacks. A newer model generation that is "safer" by recall metrics can be more vulnerable to this kind of abuse, not less.
The Benchmark Calibration Trap
One reason teams are overconfident about their safety layer is that the benchmark numbers look good. A classifier scoring 85% accuracy on the standard evaluation set feels like a solved problem.
It is not. One guardrail model that achieves 85.3% accuracy on known benchmark distributions drops to 33.8% on novel, out-of-distribution prompts—a 57-point gap. The benchmark measured performance on the kinds of prompts the model had seen during training and evaluation. Real production traffic is different. Users phrase things differently. Domains shift. New slang and circumlocutions appear weekly.
Static benchmarks go stale. A safety classifier tuned against a fixed test set will overfit to that distribution. Its recall numbers on the benchmark will stay high while its real-world performance drifts in both directions: missing new patterns of abuse and blocking new patterns of legitimate use. The benchmark score becomes a measure of how good you are at the benchmark, not how good you are at the job.
This is why per-segment measurement matters more than aggregate accuracy. A classifier that looks acceptable at the macro level can be systematically miscalibrated for specific user populations, languages, or query types—and the aggregate masks it.
Threshold Calibration Is a Product Decision
The operating threshold for a safety classifier is not a technical parameter to be set once and forgotten. It is a product decision that encodes your organization's judgment about the relative cost of false positives and false negatives. That judgment should be explicit, documented, and revisited as your product and user base evolve.
The mechanics of calibration are well-understood. Several approaches work in practice:
Platt scaling fits a sigmoid function over raw classifier scores. It works well when the miscalibration between score and probability is smooth and monotonic, which covers a large fraction of real cases.
Isotonic regression is more flexible than Platt scaling and handles any monotonically distorted score-to-probability mapping. It needs more calibration samples—at least a few thousand labeled examples to be reliable—but is the better default when data is available.
Cost-sensitive thresholding starts from an explicit cost ratio: how many false positives is one false negative worth? If you decide a missed harmful output is ten times worse than a false block, you can find the threshold on your ROC curve that minimizes expected cost at that ratio. Youden's J statistic (the point that maximizes true positive rate minus false positive rate) is a reasonable starting point when costs are roughly symmetric; minimum-distance-to-corner is better when they are not.
None of these approaches work without labeled data for your specific context. That means building an evaluation set that reflects your actual traffic—not a generic benchmark, but a dataset of real queries from real users, labeled by whether the block was correct. This is the step most teams skip because it requires manual annotation effort. It is also the step that makes calibration meaningful.
Per-Segment Monitoring as an Early Warning System
A single overall false positive rate hides the distribution of who is being hurt. The right monitoring structure tracks false positive and false negative rates broken out by:
-
Language: Safety classifiers trained on English-dominant data systematically over-flag or under-flag non-English content. A model might have a 5% false positive rate overall while running a 25% false positive rate on Portuguese or Arabic queries. Users in those language communities experience a much worse product, and the aggregate metric conceals it entirely.
-
Content category: A medical information product has different exposure than a general-purpose chatbot. The categories that generate false positives for you are not the same categories that generate false positives for someone else. Your monitoring needs to match your domain.
-
User cohort: New users, power users, and users who recently appealed a decision all have different behavioral patterns. A false positive rate that is acceptable for a general population may be catastrophic for your highest-value users.
-
Query type: If your product has different interaction modes (search, generation, summarization, multi-turn conversation), false positive rates can differ substantially across them. An input filter that works well for single-turn queries may compound into a 10% refusal rate in multi-turn sessions.
The metric to anchor on is not block rate in isolation. It is block rate plus appeal success rate plus user-reported false positive rate from feedback signals. When those three metrics diverge—when block rate is high but appeal rate is low—you may have users who are not appealing because they have already given up.
The Calibration Loop in Practice
The practical workflow for teams that take this seriously looks like this:
First, establish a baseline by sampling blocked requests and labeling a random subset as correct blocks or false positives. This gives you an actual false positive rate rather than an estimated one.
Second, build separate calibration datasets for each major segment that matters to your product. If you serve users in multiple languages, you need calibration data in each language. A single pooled dataset produces a threshold that is systematically wrong for your minority segments.
Third, set your threshold based on explicit cost assumptions, not the vendor default. Document what ratio of false positives to false negatives you are accepting and why. Review this ratio quarterly or when the product context changes.
Fourth, monitor false positive rate and false negative rate as first-class production metrics alongside latency and error rate. Safety filter performance is not a one-time audit; it is an ongoing operational concern.
Fifth, refresh your evaluation set as your traffic evolves. The users who are getting blocked six months from now will be using language and asking questions that do not look like your current labeled data. Evaluation set staleness is a slow, invisible form of calibration drift.
What This Looks Like When It Goes Right
The teams that get this right do not necessarily have better classifiers. They have better measurement. They know their operating point on the precision-recall curve. They have made an explicit choice about where to sit on that curve for each major user segment. They treat threshold revision the same way they treat any other product configuration change: with a change control process, a rollout plan, and a metrics review.
The teams that get it wrong ship with the vendor default, monitor aggregate block rate, and discover their false positive problem through support escalations—which is the most expensive and user-damaging possible detection mechanism.
Safety classifiers are not unique in having this tradeoff. But they are unusual in how rarely teams measure the cost of the tradeoff they have already made. The precision-recall curve for your safety filter exists whether you measure it or not. The only question is whether you find out where you are on it before or after your users do.
- https://arxiv.org/abs/2504.00441
- https://arxiv.org/abs/2410.02916
- https://arxiv.org/html/2405.20947v5
- https://aclanthology.org/2024.emnlp-main.1022/
- https://www.activefence.com/blog/beyond-precision-recall-evaluating-detection-models/
- https://learn.microsoft.com/en-us/azure/ai-services/content-safety/how-to/improve-performance
- https://developers.openai.com/cookbook/examples/how_to_use_guardrails
- https://www.techpolicy.press/how-measurement-can-fix-content-moderations-language-equity-gap/
