Skip to main content

The Prompt-Injection Bug Bounty: Scoping a Program When 'Broken' Has No Clear Definition

· 12 min read
Tian Pan
Software Engineer

Your security team runs a bug bounty that works. A CSRF gets paid. An XSS gets paid. An IDOR gets paid. The rules of engagement are sharp, the severity rubric is industry-standard, the triage queue moves, and the program produces a steady stream of fixed bugs. Then your AI team ships a feature last quarter — a chat surface, an agent that calls tools, a RAG pipeline that pulls from customer data — and the question that lands on the security team's desk is "what's the bounty scope for this thing?" Nobody can answer.

The reason nobody can answer is that the standard bug bounty rubric was built around a system whose specified behavior is deterministic. A login endpoint either authenticates correctly or it doesn't. An access control check either holds or it doesn't. The AI feature you just shipped has no equivalent ground truth: its specified behavior is "respond helpfully to user input," and a researcher who makes it respond unhelpfully has not necessarily found a bug — they may have found something the model has always done, that nobody knew about, that you're not sure you can fix, and that may or may not reproduce on a second attempt.

Meanwhile the researcher community is producing thousands of free jailbreak demonstrations on social media every week, and your team has no funded path to channel any of that energy at your own product. HackerOne's 2025 numbers are the canary: validated AI vulnerability reports up 210%, prompt injection findings up 540%, programs with AI in scope up 270%. The talent and the volume are already there. What's missing is the program design that turns it into fixes instead of Twitter screenshots.

The Severity Collapse

The single most damaging shortcut a program can take is to lump every "AI safety" finding into one bucket. The bucket usually contains four wildly different things:

  1. The model said something rude or off-policy. A user got it to use a slur, write a recipe the brand guidelines forbid, or roleplay as a character the comms team finds embarrassing. Reputational risk, possibly content moderation cost, no data loss, no integrity violation.
  2. The system prompt leaked. Someone got the model to recite its instructions verbatim. OWASP added this as LLM07:2025 because it can expose internal logic, tool descriptions, content rules, and sometimes credentials embedded by mistake — but in itself, leaking "you are a helpful assistant for Acme" is closer to information disclosure than exploitation.
  3. Cross-tenant data was exfiltrated through prompt injection. A researcher embedded an instruction in a document, the model followed it, and tenant A's records ended up in tenant B's response. This is a real confidentiality breach with a measurable blast radius.
  4. The agent took a destructive action. A researcher coerced the agent to call a tool — delete a record, transfer funds, ship code, send an email under your domain — that it should not have called. CVE-2025-32711, the zero-click Microsoft 365 Copilot prompt injection, scored 9.3 on CVSS. Microsoft's research on Semantic Kernel showed prompt injection cleanly turning into host-level RCE in some agent frameworks.

These four sit on completely different points of the integrity, confidentiality, and availability axes. Treating them with the same severity, the same payout, and the same SLA produces the predictable outcome: the program is overrun by category-1 reports the AI team won't fix and ignores, the category-3 and category-4 findings get buried in the queue behind them, and the researchers who could be hunting the dangerous ones go elsewhere because they can see the queue is rigged for noise.

A working severity rubric anchors itself to the CIA triad and writes AI-specific examples for each tier. Critical: cross-tenant data exfiltration, unauthorized destructive tool invocation, host or sandbox escape via the model, persistence of attacker instructions across sessions or users. High: same-tenant data exposure outside the user's authorization scope, leak of credentials or secrets embedded in system prompts, bypass of stated content policies in a way that produces actionable harmful output (CSAM, malware code, weapons synthesis). Medium: bypass of stated content policies that produces embarrassing-but-non-actionable output, leak of non-sensitive system prompts or tool descriptions. Low or out-of-scope: the model is rude, refuses when it shouldn't, makes things up about benign topics, or gets jailbroken into a fictional persona that says forbidden things in a context with no real-world side effect. CVSS alone won't do this work — OWASP's AIVSS proposal exists precisely because CVSS was designed to score code, not agent behavior — but you can mostly cover the gap by writing the rubric in plain language and binding payouts to it.

The Reproducibility Problem

Traditional bug bounty programs require a "functional, independently reproducible exploit" — many require it to reproduce 50%+ of the time. Apply that requirement naively to a probabilistic system and you exclude most of the real findings.

A prompt injection that succeeds 8% of the time against a frontier model is not "not a bug." If your agent processes a million tool calls a day and one of them sends money, an 8% success rate on a destructive-action injection is 80,000 unauthorized transfers a day. The reproduction requirement has to shift from "does it work every time" to "does it work often enough that an attacker who can attempt it at scale will get value." The unit is success rate, not yes/no.

A workable reproducibility clause asks for: a documented prompt or input artifact, a documented model version and configuration (temperature, system prompt scope, tool set), a documented success rate over N attempts (typically N ≥ 20), and a transcript or log file demonstrating at least one full success. This shifts triage from "I can't reproduce this once on my laptop" — the standard rejection for non-deterministic findings — to "the researcher claims 12/50 success on model version X with this exact tool set, and we can either confirm or refute that empirically."

Triage teams used to deterministic bugs find this culturally jarring. They want a single curl command that always returns 200 with the leaked secret in the body. They have to learn to evaluate distributions instead. Programs that don't make this transition end up rejecting the high-impact findings and paying out the deterministic-but-trivial ones, which is the opposite of what they want.

What the Scope Document Has To Say

The scope document for an AI bounty does more work than its application-security equivalent. It has to make several decisions that don't even appear in a traditional program:

  • Which tools are in-bounds for "make the agent take a destructive action." If the agent has access to a send_email tool, you have to say whether researchers may attempt to coerce it to send to attacker-controlled addresses, and to what destinations. If it has a delete_document tool, you have to enumerate test tenants and forbid touching real customer data — Stanford HAI's safe-harbor work, the FAS proposal, and the 2024 open letter from 350 AI researchers all converge on this point: without explicit safe-harbor language naming test environments, researchers either avoid the dangerous tests (and the bugs go undiscovered until an attacker finds them) or run them anyway and create legal liability for everyone.
  • Which model versions and tool configurations count. Frontier models update constantly. A finding against model-version-2026-04-15 may not reproduce against model-version-2026-05-10. The scope has to commit to a versioning scheme and an SLA on findings against deprecated versions.
  • What "in scope" means for indirect prompt injection. A document uploaded by a researcher that injects instructions into the agent's context is the canonical attack surface. Is the researcher allowed to upload their own malicious documents to a test tenant? Are they allowed to send emails to the agent's inbox if the agent reads email? Are they allowed to register a website that the agent's web tool will fetch? The scope has to enumerate the channels.
  • What the program will not pay for. A "what we won't pay for" list is the single highest-leverage section in the document. Working programs publish this on day one. Common entries: jailbreaks of the base model that don't escalate beyond the base model's existing public failure modes, system-prompt leaks where the system prompt contains no sensitive content, output-policy bypasses on hypothetical scenarios with no real-world side effect, and rediscovery of public attack patterns that have already been disclosed.

The first month of any new AI bounty produces noise. Every researcher in the world has a "I made it forget the system prompt" transcript saved, and they will all submit it the day you go live. Programs that survive the first month have published the won't-pay list on day zero, ratelimited submissions, and made the severity rubric concrete enough that triage can close low-value reports in seconds rather than minutes.

Cross-Functional Surface

Three teams have to actually move for the program to launch, and any one of them can stall it indefinitely.

Legal has to bless safe-harbor language for adversarial AI testing. Most existing safe-harbor language was drafted for traditional security research and does not cover, e.g., a researcher repeatedly attempting to extract another (test) tenant's data through prompt injection. The Computer Fraud and Abuse Act has historically been the lever; HackerOne's January 2026 expansion of safe harbor specifically for "Good Faith AI Research" is the template most programs will end up adopting. Without this, the program either has no scope on the most valuable categories or quietly accepts that researchers will work outside the safe-harbor envelope and trust the company not to sue.

Communications has to write a disclosure policy that doesn't pretend the model is deterministic. A traditional vuln disclosure says "we patched X in version Y, here is the CVE." An AI disclosure has to acknowledge that the fix is a mitigation against a class of inputs, that the success rate is now Z% rather than zero, that the underlying class may resurface in future model versions, and that the company will or will not file a CVE. Anthropic, Google, and Microsoft all paid AI agent bounties in 2025 and chose not to file CVEs for several of them — a defensible choice, but the policy needs to be written down rather than improvised per finding.

The AI team has to actually fix what the program reports rather than label it "expected probabilistic behavior" and close the ticket. This is the most common failure mode and the most program-killing one. If your AI team's posture is "the model is non-deterministic, every prompt injection is in some sense expected, we'll address it in training next quarter," researchers will stop submitting after their first three findings get the "wontfix" label. The program has to bind the AI team to the same SLA as the application security team — not on time-to-retrain, but on time-to-mitigate (input filter, output filter, tool-use guardrail, scope reduction).

The Payout Curve

The economic design of the bounty is what determines whether you get the findings you want or the findings nobody else cares about. A few principles that have emerged:

  • Pay for new attack classes, not rediscoveries. A researcher who finds a fundamentally new injection vector — a novel encoding, a novel multi-turn manipulation, a novel indirect channel — should make 5-10x what a researcher who applies a known pattern earns. Rediscoveries should still pay (you don't want them to go unreported), but at a low enough multiplier that the talented hunters chase the new attack classes.
  • Pay proportional to demonstrated success rate. A finding that succeeds 80% of the time against your specific deployment is qualitatively different from one that succeeds 2% of the time. Your payout should reflect that.
  • Pay extra for chained findings. A prompt injection that leaks the system prompt is a finding. A prompt injection that leaks the system prompt, uses the leaked tool descriptions to construct a destructive call, and demonstrates the destructive call succeeded is three findings — but the value to you is the chain, not the parts. Programs that pay only for the leaf findings train researchers to stop at the first one.

The Realization That Forces the Program

The realization that pushes most security teams to fund the program is this: the AI feature you shipped has a security testing community that will work on it for free. They will produce thousands of attempts a week regardless of whether you have a program. Your only choice is whether those attempts arrive in your triage queue with reproduction steps, success rates, and contact info — or arrive on Twitter as a screenshot with no opportunity to remediate before disclosure.

A program that ships in an unscoped, ill-rubric'd state is worse than no program — it teaches the researcher community that submitting to you is a waste of time and pushes them to public disclosure as the only mechanism that gets their findings taken seriously. A program that ships with a CIA-anchored severity rubric, a probabilistic reproducibility clause, an explicit tools-in-scope enumeration, named test tenants under safe harbor, a "won't pay for" list, and an AI team bound to a mitigation SLA is something close to a generative engine for production-relevant findings on a system class your security team has otherwise no idea how to pen test.

The hard part isn't the bounty amounts. It isn't the platform choice between HackerOne and Bugcrowd and the AI-specific players. It's writing a scope document that reflects a system whose specified behavior is "be helpful" and getting three teams that don't usually work together to agree on what counts as broken. Programs that do that work in advance ship findings. Programs that wait until the first submission to figure it out ship Twitter posts.

References:Let's stay in touch and Follow me for more thoughts and updates