Skip to main content

Prompt Injection Detection at 100,000 Requests Per Day: Why Simple Defenses Break and What Actually Works

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection defense is broken after a user finds it, not before. You add "ignore all previous instructions" to your blocklist and ship. Three months later an attacker encodes the payload in Base64, or buries instructions in HTML comments retrieved via RAG, or uses typoglycemia ("ignroe all prevuois insrtucioins"), and your entire defense evaporates. The blocklist doesn't help because prompt injection has an unbounded attack surface — there is no closed vocabulary of malicious inputs.

At low traffic volumes you can absorb the cost of calling a second LLM to validate each request. At 100,000 requests per day, that math becomes ruinous and the latency becomes user-visible. This post is about what the architecture looks like when brute-force approaches stop working.

Why Static Defenses Don't Scale

The instinct when first defending against prompt injection is pattern matching: maintain a list of known bad phrases, run each input through it, block on match. This works precisely until attackers know you're doing it.

Research evaluating static defenses across hundreds of runs reveals the core problem: defenses fail not because of poor implementation, but because they cover different attack surfaces than the actual injection vectors. An attacker who knows you're checking for "ignore previous instructions" will use "disregard your earlier directives" or switch to a language you don't have patterns for or split the instruction across a multi-turn conversation.

The attack surface problem is structural. Prompt injection differs from SQL injection in a critical way: SQL has a defined grammar, so malicious inputs are structurally distinguishable from benign ones. Natural language has no such grammar. Data, instructions, and user intent flow through the same execution path, making perimeter-style defenses inherently incomplete.

The 2024-2025 landscape made this worse. Injection attacks have shifted from direct user input to second-order vectors: malicious content embedded in documents the agent retrieves, websites the agent browses, emails the agent reads. The attack no longer needs to touch your input validation at all. GitHub Copilot's CVE-2025-53773 — remote code execution via instructions hidden in repository code comments — didn't involve user input. Neither did Microsoft 365 Copilot's EchoLeak, which exfiltrated data via crafted emails that flowed through the retrieval layer.

Static blocklists also compound in cost: you can add patterns forever and the list will never be complete, but each addition adds latency and maintenance burden.

Why LLM-as-Judge Doesn't Work at Scale

The obvious alternative to pattern matching is semantic understanding: use a language model to judge whether an input contains an injection attempt. The model understands context, handles paraphrase, catches encoding tricks. Why not deploy a second LLM to evaluate each request?

Cost. At 100,000 requests per day, a second LLM inference per request runs to millions of additional tokens daily. One engineering team that implemented this approach reported their evaluation infrastructure cost ten times their baseline agent workload. That's not an edge case — it's the structural consequence of using frontier model inference as a per-request filter.

Latency is the other constraint. A single LLM inference takes 100 to 500 milliseconds depending on the provider and model. If your application needs to respond in under a second, and your primary model inference already consumes most of that budget, you have no room for a second full LLM call before the response. The detection overhead alone would make your application feel broken.

There's a third problem that research surfaced in 2024: LLM judges are themselves vulnerable to prompt injection. An attacker can craft payloads specifically designed to deceive the judge's classification decision. Running an LLM to protect against attacks on your LLM creates a recursive surface area.

The practical conclusion: LLM-as-judge is appropriate for sampling (evaluate 10–20% of traffic asynchronously for anomaly detection) or for genuinely high-stakes decisions where you can absorb latency and cost. It's not viable as a per-request synchronous gate at high throughput.

The Lightweight Classifier Stack

Production systems that work at scale use a layered approach: multiple small, fast models in sequence, each handling what it does cheaply and well.

Layer 1: Heuristic pre-screening (1–5ms). Substring matching against known patterns runs in under two milliseconds on CPU and catches a significant fraction of unsophisticated attacks. "Ignore all previous instructions," "your system prompt is," "DAN mode" — these show up constantly in production attack logs. Heuristics won't catch anything obfuscated, but they eliminate a chunk of traffic from the more expensive downstream steps. The Rebuff open-source framework demonstrated this approach, though its implementation turned out too slow for high-throughput use; cleaner implementations in Rust can run in under two milliseconds.

Layer 2: Lightweight transformer classifier (20–50ms). Several fine-tuned models exist specifically for injection detection:

  • Llama Prompt Guard 2 (86M parameters): Meta's purpose-built detector, designed for CPU and GPU deployment, achieves high recall on direct injection attempts with around 20ms inference time.
  • DeBERTa-v3-base (ProtectAI fine-tune): Achieves F1 scores above 90% with 40–80ms inference on GPU. The small variant trades minor accuracy loss for sub-50ms CPU inference.
  • DistilBERT-based detectors: Faster but lower accuracy than DeBERTa; appropriate when latency budget is extremely tight.

The PromptGuard modular framework, which layers input gatekeeping with structured prompt formatting and output validation, achieved a 67% reduction in injection success rate with under 8% latency overhead — roughly 14ms on a 200ms total budget.

Layer 3: Behavioral signals (async, sampled). The classifiers above detect known injection patterns well. Zero-day attacks, novel encodings, and second-order injection vectors from RAG are harder. Behavioral monitoring catches a different failure class: it observes whether the model's output looks like it was redirected by an injection rather than completing the intended task.

Useful behavioral signals:

  • Response length anomalies (legitimate queries produce consistent length distributions; injected redirects often trigger verbose explanations or unexpected output volume)
  • Token entropy shifts (injected instructions often use different vocabulary patterns than the surrounding context)
  • Tool call sequences that deviate from baseline (an agent that normally calls three tools and suddenly calls eight, or calls tools in an unusual order)
  • Output content that matches patterns suggesting exfiltration (email addresses, API keys, internal identifiers appearing in responses)

Because behavioral monitoring runs asynchronously after the response is returned, it doesn't add to request latency. Sampling 10–20% of traffic gives you statistical coverage without the cost of evaluating everything. The tradeoff is that you detect attacks after the fact rather than blocking them in real time — appropriate for low-consequence outputs, insufficient for agents with write access to critical systems.

The Over-Defense Problem Nobody Talks About

Benchmark papers focus on attack detection rate. Production teams learn about over-defense the hard way.

Over-defense is when your guard model blocks legitimate requests. At 1–2% false positive rate — which sounds excellent on paper — you're blocking 1,000 to 2,000 users per day at 100,000 requests per day scale. Users discussing security research, asking questions that include patterns resembling injections in the input, querying about their own system prompts — all of these trigger false positives in current guard models.

Research published in 2024 found that current guard models achieve less than 60% accuracy at distinguishing benign inputs from attacks, with some operating near random-chance performance on benign edge cases. InjecGuard demonstrated an 83% average accuracy improvement across benign, malicious, and over-defense categories — a significant advance, but it established how poor the baseline was.

Practical mitigations:

  • Tune detection thresholds conservatively for your specific application domain. A customer service bot and a code generation agent have completely different input distributions.
  • Use span-level flagging rather than binary block/allow decisions where possible. Flagging suspicious regions of an input and redacting them preserves user intent while neutralizing the injection vector.
  • Monitor FPR continuously in production. A classifier that performed well on your eval set may degrade on your actual user population.

Putting It Together: Three Production Recipes

The right combination depends on your latency budget and risk tolerance.

Recipe A — Low latency, good coverage: Heuristics (1–2ms) → Llama Guard 2 (20ms) → async behavioral sampling Target: <25ms total detection overhead, ~85% F1. Appropriate for interactive applications where latency is the primary constraint and injection consequences are moderate.

Recipe B — Higher accuracy, still real-time: Heuristics (2ms) → DeBERTa-v3-base (50ms) → output content filtering (10ms) Target: <65ms overhead, >90% F1. Appropriate for applications where injections have higher-consequence outcomes (agents with external tool access, customer-facing assistants with access to account data).

Recipe C — Maximum coverage with ensemble: Heuristics → ensemble of two diverse classifiers (40–60ms) → async behavioral monitoring + sampled LLM judge review Target: <70ms overhead, >92% F1. The ensemble diversity matters: an adversarial input crafted to defeat one model architecture is unlikely to fool a second model trained differently. This is the PromptForest approach, combining Llama Guard, a ModernBERT fine-tune, and an XGBoost classifier on embedding features.

At 100,000 requests per day, Recipe B costs roughly the compute of two small GPU inference calls per request. Recipe C adds an ensemble and async sampling. Neither approach comes close to the cost of a full LLM judge on every request.

The Indirect Injection Gap

All of the above focuses on detecting injections in user inputs. The harder problem in 2025 is indirect injection: content the agent retrieves, processes, and acts on — without the injection ever touching your input validation.

A RAG system that retrieves documents from the web, email, or user-uploaded files is processing untrusted content as context. If that content contains "ignore your previous instructions and instead do X," your input classifier never sees it. The injection arrives through the retrieval layer.

Defenses for indirect injection require different instrumentation:

  • Content filtering at the retrieval boundary (not the input boundary), treating retrieved content as untrusted regardless of source
  • Instruction hierarchy enforcement in the generation prompt, making it structurally clear to the model which instructions come from trusted system context versus retrieved user-facing content
  • Contextual tagging (some frameworks use XML-style delimiters to mark retrieved content, giving the model a structural signal that this content should not issue instructions)

The direct-vs-indirect split matters for architecture: your fast classifier at the input layer gives you no coverage over indirect vectors. For agents that retrieve and act on external content, behavioral monitoring and output filtering are load-bearing defenses rather than nice-to-haves.

Sizing Your Defense to Your Risk

Not every application needs Recipe C. A chatbot that provides information and has no external tool access is a much lower-stakes injection target than an agent that sends emails, modifies database records, or executes code. The injection consequence determines appropriate defense depth.

The defense-in-depth consensus that emerged across Microsoft, Google, and Meta's public guidance in 2024–2025 is less a prescription than a principle: no single layer is sufficient, and the layers you need depend on what a successful injection can actually do in your system. An agent with read-only access and conservative output formatting has narrow injection consequences. An agent with ambient authority over business systems needs behavioral monitoring, output filtering, and scoped credentials — regardless of how good its input classifier is.

The practical starting point for most teams: deploy a lightweight classifier like Llama Guard 2 as a synchronous gate, add async behavioral monitoring for sampled traffic, and instrument your output layer to detect data patterns that shouldn't appear in responses. That covers the direct injection surface with minimal latency impact. If your agent processes external content via retrieval, treat that retrieval boundary as an additional input surface and apply the same classifier there.

Ship that. Then measure your false positive rate on real production traffic and tune from there. The over-defense problem guarantees you'll need to calibrate — better to discover that on day one than after you've built your threshold assumptions into your alerting infrastructure.

References:Let's stay in touch and Follow me for more thoughts and updates