The Bypass Vocabulary: When Users Learn to Jailbreak in Polite English
The cheapest jailbreak in your production traffic isn't a clever Unicode trick or a chained adversarial suffix. It's three additional words a user typed after their first request got refused. They added "just hypothetically." They added "for a research paper." They added "for a fictional story I'm writing." The model complied. They told a friend. The friend posted a TikTok. By the end of the month, a non-trivial slice of your refusal-blocked traffic is being routed around with English so polite that none of your prompt-injection filters fire.
This is the failure mode the security team didn't put on the threat model. The threat model assumed adversaries were sophisticated, motivated, and technical. The actual adversary is a curious user who saw a screenshot. The vocabulary they're using doesn't show up in any public jailbreak corpus because by the time it hits a paper, the live distribution has moved on.
What Bypass Vocabulary Actually Looks Like
Classical prompt injection works at the token level — special characters, fragmented trigger words, role-override sequences, payloads smuggled through structured-output channels. There's a literature for it, and your input filters were probably designed against it.
Bypass vocabulary works one layer up, at the framing level. It's the user reaching past whatever pattern match the safety classifier was looking for and reframing the request as something the model has been trained to find acceptable. A handful of patterns dominate:
- Hypotheticality framings. "Hypothetically, if someone wanted to..." "As a thought experiment..." "Just imagine..."
- Educational context. "For educational purposes..." "I'm teaching a class on..." "My professor asked me to research..."
- Researcher persona. "I'm a security researcher studying..." "I'm writing a paper on..." "For a thesis on harm reduction..."
- Fiction framing. "In a story I'm writing, the villain explains..." "For a screenplay where..." "My character needs to know..."
- Translation framing. "Translate the following from a textbook..." "Explain this old document..."
None of these are individually dishonest. Every one of them is a real use case for some user. That is exactly what makes them effective. The model has been trained on enough examples of legitimate educational, research, and fictional content that the framing reliably tips its policy decision toward "respond." Users figure this out empirically, by trial and error against the refusal surface, and they share the patterns that worked.
The viral spread is the part that makes this an engineering problem and not just a security curiosity. A jailbreak technique that requires reading a paper has a slow adoption curve. A jailbreak technique that fits in a screenshot caption has the adoption curve of a meme. Production teams don't get to decide which one they're defending against.
The Retry Is the Signal
You can see this in production telemetry if you instrument for it. The signature is a session shape: user issues a borderline query, gets a refusal, edits the query with a framing modification, and gets a successful response. That's the bypass loop, and it's the cleanest leading indicator of bypass-vocabulary growth in your user base.
Three telemetry pieces matter, and they're all things most teams aren't logging today:
- Refusal-then-retry success rate. Per session, when a refusal is followed by a re-prompt within N seconds and the same intent class, what's the probability the retry succeeds? If that number is climbing, your refusal policy is leaking to whatever framing is hot this week.
- N-gram drift on user inputs. Track the prevalence of bypass-marker phrases ("hypothetically," "for educational purposes," "as a thought experiment," "in a fictional setting") across your input distribution over time. A spike in any one of them is rarely organic. It's almost always your user community discovering or re-discovering a framing that works.
- Per-refusal classification of bypass exposure. When the model refuses, ask a separate classifier: would the bypass vocabulary currently in circulation circumvent this refusal if applied to this query? That gives you a leading indicator of which refusals are about to stop holding.
The retry pattern is also the ground truth source for the eval corpus you actually need, which is the next problem.
Public Jailbreak Corpora Are Already Stale
The public datasets — the ones with thousands of jailbreak prompts collected from Reddit and Discord — are useful for studying the historical structure of these attacks. They're less useful for evaluating whether your current policy holds, because the vocabulary in them is, by construction, vocabulary that was popular long enough ago to make it into a research dataset.
The eval corpus that actually catches your production failures has three properties:
- Sourced from your retry patterns. The bypasses that matter are the ones your users are running. Mining them out of the refusal-then-retry-success log is the highest-signal source you have. Anonymize, hash, redact PII as appropriate, but treat the production retry stream as the canonical bypass-corpus input.
- Refreshed at the cadence the vocabulary mutates. Monthly is roughly right for most product surfaces. Quarterly is too slow. By the time you've curated a quarterly corpus, three new framings have already gone viral.
- Scored against policy intent, not pattern matches. A bypass-corpus eval that checks for the literal string "for educational purposes" will be defeated by the next user who learns to write "in a teaching context." Your eval needs a judge that understands what the policy is actually trying to prevent and grades the response on whether the policy intent was upheld. Pattern-matching evals age out of relevance at the same speed the public papers do.
The realization underneath all of this: a refusal policy in an adversarial environment is a living document, and the eval corpus is its test suite. A test suite frozen at the point you wrote the policy will pass long after the policy has stopped working.
Topic Refusal vs. Action Refusal
Here's the part of the discussion teams keep deferring because it's harder than the engineering: some of these bypass framings are legitimate.
A researcher genuinely studying online radicalization needs to discuss content the policy would refuse to a casual user. A fiction writer drafting a thriller needs the villain to be plausible. An educator preparing a curriculum on misinformation needs to engage with the misinformation. "For educational purposes" should sometimes work. The policy that always refuses these framings ships a product that loses every legitimate research, education, and creative-writing use case to a competitor that didn't.
This forces a choice the safety team often hasn't made explicit: is the policy refusing the topic or refusing the action?
- Topic refusal: "We don't engage with this subject matter at all, regardless of framing." Justified for genuinely off-limits domains where there's no useful response a legitimate user could want — e.g., operational instructions for mass-casualty attacks.
- Action refusal: "We don't perform this specific action — generating step-by-step operational content, producing targeting information, drafting harassment — but we'll discuss the topic in framings that don't involve performing the prohibited action."
Most policies are written as topic refusals and enforced inconsistently as action refusals because the team never explicitly decided. The result is the worst of both: legitimate users hit refusals on questions a competing product answers, and bypassed users get the prohibited action anyway via the same framings the legitimate users were trying. Picking explicitly between topic-refusal and action-refusal — per policy line, not in aggregate — is the discipline that lets you stop both leakage and over-refusal at the same time.
This decision is also the input that determines what your bypass corpus is even testing for. An action-refusal policy needs an eval that grades on whether the prohibited action was performed under the bypass framing. A topic-refusal policy needs an eval that grades on whether the topic was engaged with at all. Without that distinction, the corpus produces noisy verdicts and the team starts ignoring the dashboard.
The Org Failure Behind the Engineering Failure
The technical fix — corpus, telemetry, eval refresh, intent-vs-action policy frame — is mostly known. The reason most teams don't ship it is organizational.
The safety team owns the policy. The eval team owns the corpus. The growth team owns retry rates. The product team owns the refusal user experience. None of them, by themselves, sees the full bypass loop. The safety team sees a policy that's correctly written. The eval team sees a corpus that passes. The growth team sees retry rates climbing and reads it as a search-quality problem. The product team sees customer complaints about over-refusal and doesn't see the bypass volume on the other side.
The bypass loop is visible only when you join refusal logs to retry logs to input n-gram drift to outcome classifiers — across team boundaries that nobody owns end-to-end. Most companies discover this by reading a Reddit post about how to jailbreak their product, not by looking at their own data.
What This Forces You to Build
Refusal is not a configuration that ships once. It's a surface that has to be re-tested against the current adversarial distribution, and that distribution moves at the speed user communities can iterate, which is faster than any monthly release cycle. Three things follow:
- A live bypass corpus mined from your own production retry patterns, not from public papers.
- A monthly refresh cadence on that corpus, with the eval scored on policy intent rather than string match.
- An explicit topic-refusal versus action-refusal decision on every line of policy, with a reviewer who can defend the choice.
The team that does this catches new bypass vocabulary inside one refresh cycle, with a test that grades the right thing. The team that doesn't is shipping a policy that worked the day it was written and has been quietly unenforced ever since. The user base is already running the bypass at scale; the only open question is whether anyone on the team is looking at the data that would show it.
The longer answer to the original question — why your users learned to jailbreak in polite English — is that the refusal surface is, in practice, a public API that adversarial users probe in real time and document on social platforms. Treating it as a static config is the same category of mistake as treating an external HTTP endpoint as if it were an internal function call. The mitigation is the same too: instrument it, version it, test it against the traffic it actually receives, and accept that the threat model is going to keep moving.
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://github.com/verazuo/jailbreak_llms
- https://www.promptfoo.dev/blog/how-to-jailbreak-llms/
- https://www.lakera.ai/blog/jailbreaking-large-language-models-guide
- https://arxiv.org/html/2507.11878v1
- https://www.promptlayer.com/research-papers/the-secret-language-of-llm-refusals
- https://gradientflow.com/refusal-vectors/
- https://thehgtech.com/guides/llm-jailbreaking-defense.html
