Statistical Watermarking for LLM Output: How Token Logit Bias Creates Detectable Signatures
Google has been watermarking Gemini output for every user since October 2024 — 20 million users, no perceptible quality degradation, algorithmically detectable. OpenAI has a working prototype that requires only a few hundred tokens to produce a reliable signal. Anthropic says it's on the roadmap. The EU AI Act's Article 50 mandates machine-readable marking of AI-generated content for covered providers. And yet: a $0.88-per-million-token attack achieves ~100% evasion success against seven recent watermarking schemes simultaneously.
This is the actual state of LLM text watermarking. The gap between what's deployed, what the papers claim, and what adversaries can do is wider than most teams realize — and the engineering decisions you make about watermarking depend heavily on which side of that gap you're standing on.
How Statistical Text Watermarking Works
The foundational approach — and the one most production systems trace back to — partitions the model's vocabulary into two sets at each token generation step: a green list and a red list. A hash function keyed on the preceding token (or a window of preceding tokens) determines which tokens land in which list. The model then applies a positive logit bias to every green-list token before sampling.
The result is that watermarked text statistically overrepresents green tokens. A detector measures this overrepresentation using a z-score and flags text above a threshold as watermarked. Because the partition is key-dependent and changes at each position, an outside observer seeing only the text can't tell which tokens "should" have been green — only someone with the key can verify the signal.
In practice the scheme has two tunable knobs. The partition ratio controls what fraction of the vocabulary ends up in the green list (50/50 is common). The logit bias delta controls how strongly the model favors green tokens. Turn up delta and the watermark is easier to detect and harder to remove, but the model starts making unnatural word choices. Turn it down and quality improves but detection becomes unreliable on short passages.
Research reports low false-positive rates (under 3%) and false-negative rates (under 1%) under controlled conditions. The key phrase is "under controlled conditions." Those figures assume the text hasn't been touched after generation.
What Google Actually Shipped
Google's SynthID-Text (deployed in Gemini, open-sourced on GitHub and Hugging Face) is the most mature production implementation. Rather than a simple green/red partition, SynthID uses a pseudo-random function applied to the generation context to adjust token probability scores in a way that's invisible to humans but algorithmically detectable.
The important technical choices SynthID makes:
- Watermarking happens entirely at inference time — no changes to model training, weights, or fine-tuning.
- Detection doesn't require running the underlying LLM. The detector is lightweight and can be deployed independently.
- The detector outputs three states: watermarked, not watermarked, or uncertain. The Bayesian framing lets you reason about detection confidence rather than treating it as a binary.
The 20-million-user scale test showed no measurable quality degradation. But Google's own documentation is candid about where the signal weakens: factual responses give the model less freedom to choose among semantically equivalent tokens, so there are fewer opportunities to embed the watermark without forcing unnatural choices. When you constrain the model to produce a specific fact, you constrain its ability to choose green tokens.
The other acknowledged limitation: thorough rewriting or back-translation through another language degrades detector confidence significantly. Google describes the watermark as "semi-fragile" — resilient to incidental modifications, vulnerable to deliberate transformation.
The Attacks That Actually Work
This is where the research-practice gap bites hardest.
Paraphrase attacks are the obvious threat, and they work. Semantically equivalent rewrites preserve meaning while disrupting token-level statistics. Adaptively trained paraphrasers (models fine-tuned to rewrite while evading detection) can defeat most current schemes with negligible impact on text quality.
Back-translation — routing text through another language and back — is surprisingly effective. The sentence structure changes enough to scatter the token statistics, and the semantic content survives. Cross-lingual attacks that permanently shift to another language are even more effective, since the watermark is trained on the source language's vocabulary partition.
The Self-Information Rewrite Attack (SIRA) is more systematic. It identifies the tokens most critical to the watermark signal (the high-entropy positions where the model had genuine choice) and rewrites those positions while preserving the rest. The attack achieved ~100% success against seven watermarking schemes at $0.88 per million tokens. That's not a rounding error — it's a fundamental problem: the tokens that carry the watermark are the same tokens that carry the semantic flexibility, so attacks that target one automatically target the other.
Black-box scrubbing (the "b4" attack, NAACL 2025) requires no knowledge of the watermarking scheme — just API access. This matters because it means watermarks can be evaded without reverse-engineering the specific scheme being used.
Code is especially fragile. Variable renaming, dead code insertion, and minor structural changes are semantically transparent to a human reader but statistically damaging to a watermark. Code generation contexts provide far more opportunities for meaning-preserving rewrites than prose.
Multi-Bit Watermarking: The Attribution Promise and Its Cost
The green/red scheme encodes one bit: watermarked or not. More ambitious approaches encode multi-bit payloads — model version, timestamp, user ID — embedded in the generated text. The potential is real: you could trace a piece of AI-generated content back to the specific model call that produced it.
The problem is extraction cost. Some 32-bit schemes require hours of computation to extract the payload from a piece of text. MirrorMark embeds 54 bits in 300 tokens with meaningful accuracy improvements over earlier approaches, but the infrastructure needed to run extraction at production scale remains a significant engineering commitment for most teams.
Undetectable watermarks — based on cryptographic assumptions, impossible to distinguish from non-watermarked output without the secret key — exist as a theoretical construction. They're not deployed in any known production system.
What This Means if You're Evaluating Watermarking
Compliance-motivated deployment is different from adversarial-robustness deployment, and conflating them leads to bad decisions. If your goal is regulatory compliance (EU AI Act, internal content-provenance policies), a watermark that's detectable under ordinary conditions and degrades gracefully under casual modification may be sufficient. You're not defending against a determined adversary with API access; you're creating a machine-readable signal for auditing and attribution pipelines.
If your goal is adversarial robustness — detecting deliberate evasion by someone trying to launder AI-generated content — the picture is harder. Current schemes don't hold up against targeted attacks. The robustness-imperceptibility-quality trade-off triangle has no corner that wins on all three axes simultaneously.
Open-source LLMs are a different problem. Inference-time logit bias watermarking only works if you control the decoding process. When users run open-source models locally or on their own infrastructure, they can simply disable the watermarking step. Parameter-level watermarking — embedding signals in the model weights themselves — provides a partial answer, but it's a different technique with different robustness properties and deployment complexity.
The detector needs to be part of your system design, not an afterthought. Detection without the secret key is computationally equivalent to guessing. Detection with the key is fast for single-bit schemes, expensive for multi-bit payloads. If your use case requires high-throughput attribution, benchmark the detector, not just the generator.
The Regulatory Pressure Shaping the Field
EU AI Act Article 50 requires general-purpose AI providers to mark outputs in machine-readable format. This applies to providers, not to every deployer of a model — but it creates strong incentive for frontier model vendors to ship watermarking before enforcement begins.
The incentive structure matters for the research agenda. Compliance watermarking tolerates a weaker signal than adversarial detection. If the regulatory goal is "creates an audit trail," then a watermark that's detectable under honest conditions and degrades under deliberate attack may satisfy the requirement while being straightforwardly defeatable. The field is grappling with this gap: a watermark that's legally sufficient may not be technically meaningful.
The White House voluntary commitments from major labs (2023) included watermarking research. Google followed through with SynthID deployment. OpenAI has a prototype and no announced rollout date. Anthropic publishes no timeline.
This asymmetry matters if you're building on top of these APIs: watermarking availability varies by provider in ways that aren't surfaced through standard documentation. You can't assume a watermark exists — or plan around one — unless you know exactly which model and which deployment configuration you're hitting.
Where to Invest if You're Building on This Today
If you need provenance tracking on AI-generated content, the pragmatic path involves a few concrete decisions:
- Use an existing implementation, not a custom one. SynthID-Text is open-sourced and integrates with Hugging Face Transformers. The KGW reference implementation is available and well-documented. Rolling your own token-level watermarking scheme introduces attack surface without improving robustness.
- Design for graceful degradation. Plan for the watermark to be absent or degraded in content you receive — don't architect workflows that hard-fail when detection returns uncertain. Treat detection confidence as a signal, not a binary gate.
- Separate compliance from fraud detection. Watermarks that satisfy auditing requirements won't stop determined evasion. If you need to detect deliberate manipulation, combine watermarking with other signals: metadata preservation, provenance chains, behavioral fingerprinting. No single mechanism is sufficient.
- Benchmark extraction cost before committing to multi-bit attribution. A scheme that takes eight hours to extract a 32-bit payload is a forensics tool, not a production attribution layer.
The field is moving fast. The gap between the attack papers and the defense papers has been shrinking — not because defenses are catching up across the board, but because specific attack vectors are being closed while new ones open. Schemes that looked robust in 2024 are being systematically broken in 2025 papers. Any engineering decision you make here should include a plan for the underlying scheme becoming obsolete.
- https://www.nature.com/articles/s41586-024-08025-4
- https://dl.acm.org/doi/10.1145/3691626
- https://arxiv.org/html/2411.18479v1
- https://ai.google.dev/responsible/docs/safeguards/synthid
- https://github.com/google-deepmind/synthid-text
- https://arxiv.org/html/2505.05190v1
- https://aclanthology.org/2025.naacl-long.460.pdf
- https://arxiv.org/html/2502.16699v1
- https://arxiv.org/html/2411.13425v1
- https://proceedings.mlr.press/v247/christ24a/christ24a.pdf
- https://arxiv.org/html/2601.22246v1
- https://www.usenix.org/system/files/usenixsecurity25-qu-watermarking.pdf
