Context Length Is a Security Boundary, Not Just a Cost Line

May 17, 2026 · 9 min read

Software Engineer

Most teams treat the context window as a budget. You have a million tokens; spend them wisely; longer conversations cost more and run slower. That framing is correct and incomplete. The context window is also an attack surface, and its size is a dial that quietly weakens your safety controls as it turns up.

Here is the failure mode nobody puts in the threat model. Your system prompt — the one with the guardrails, the tool-use rules, the "never do X" clauses — sits at the very top of the context. Its authority is strongest there. As a conversation runs, thousands of tokens of user turns, tool outputs, and retrieved documents pile on top of it. The model's attention does not weigh all of those tokens equally. The instructions closest to the point of generation win ties. By turn forty, your guardrails are not gone, but they are buried, and a patient adversary does not need a clever jailbreak to get past them. They just need a conversation long enough.

This is not a hypothetical. It is a measurable property of how transformers attend to long contexts, and it has a name in the research literature even if it does not have one in your incident review template.

Why instructions decay with distance

Transformers do not read a prompt the way a person reads a contract, giving the opening clauses permanent legal force. They attend. Every token the model generates is conditioned on a weighted blend of every token before it, and those weights are not uniform.

Two effects dominate. The first is recency: tokens near the end of the context — the most recent user message, the latest tool output — carry disproportionate weight, because that is where the model has learned the immediately relevant signal usually lives. The second is the lost-in-the-middle problem: information stranded in the middle of a long context gets underweighted relative to both the beginning and the end. Researchers measured a 30%-plus accuracy drop on multi-document question answering simply by moving the answer document from the first position to the tenth in a twenty-document context. The model did not forget the document. It discounted it.

Position encodings such as RoPE bake a distance-based decay into attention itself: tokens far apart have their attention scores mechanically reduced. Early tokens get some protection from "attention sinks," a learned habit of routing spare attention to the start of the sequence. But that protection is partial and competes with everything newer.

Now apply this to your system prompt. It starts in the strongest position — the very beginning. But as the context grows, it is no longer near the end, and the freshest, closest, highest-weighted instructions are whatever just arrived in the conversation. Including instructions an attacker wrote. The guardrail and the attack are not evaluated as "trusted policy" versus "untrusted input." They are evaluated as "old and far" versus "new and close."

A long conversation is a slow jailbreak

The cleanest demonstration of context length as a weapon is many-shot jailbreaking. The technique is almost insultingly simple: fill the prompt with hundreds of fabricated dialogue turns in which an AI assistant cheerfully answers harmful questions, then ask your real harmful question at the end. No exotic token sequences, no adversarial suffixes. Just volume.

It works because the model is doing in-context learning on the pattern you handed it. The effectiveness scales as a clean power law with the number of shots, and it generalized across every major model family tested. The vulnerability did not exist in any practical form until context windows grew large enough to hold hundreds of examples. Capability and attack surface arrived in the same release.

The version that should worry product teams is subtler than the lab attack, because it does not need a single crafted mega-prompt. It needs a normal-looking conversation. An adversary spends thirty turns establishing a frame — a role-play, a "hypothetical," a fictional system with different rules — and each turn adds tokens that push the real system prompt further from the point of decision while reinforcing a competing set of instructions right where attention is strongest. Turn thirty-one asks for the thing the guardrail forbids. Nothing about turn thirty-one looks like an attack. The attack was the previous thirty turns of context accumulation, and your safety review almost certainly tested turn one.

Safety testing on short prompts misses the real risk

Here is the uncomfortable part for anyone who has signed off on a red-team report. Most safety evaluation runs on short prompts. You write an adversarial input, you check that the model refuses, you log a pass. That tells you the guardrail holds when the guardrail is the freshest, closest, highest-weighted instruction in the context — the easiest possible condition for it.

It tells you almost nothing about turn forty.

Recent work that deliberately tested long-context agents found the safety picture does not just degrade gracefully — it becomes unstable. Models advertising million-token-plus windows showed severe behavioral change well before 100K tokens. Refusal rates did not slide predictably; they swung in opposite directions across models. One model's refusal rate climbed from roughly 5% to 40% as context grew; another's collapsed from 80% to 10%. That divergence is the real finding. It means refusal under long context is not being driven by a stable safety judgment. It is being driven by architectural and training artifacts that you cannot predict from the short-prompt test.

The same research surfaced "delayed refusals" — the model starts executing a harmful request, produces part of the output, and only then refuses. For a chatbot that is a bad look. For an agent with tool access, partial execution means the side effects already happened: the email sent, the row deleted, the API called. Long context made delayed refusals more common. An endpoint check that only asks "did it ultimately refuse?" scores that as a pass.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Context Length Is a Security Boundary, Not Just a Cost Line

Why instructions decay with distance

A long conversation is a slow jailbreak

Safety testing on short prompts misses the real risk

Recommended Reading

About Tian Pan

Why instructions decay with distance​

A long conversation is a slow jailbreak​

Safety testing on short prompts misses the real risk​

Recommended Reading

About Tian Pan

Why instructions decay with distance

A long conversation is a slow jailbreak

Safety testing on short prompts misses the real risk