Skip to main content

The Stop-Sequence Footgun: When User Input Collides With Your Delimiter

· 10 min read
Tian Pan
Software Engineer

A user pastes a chunk of markdown into your support agent. The first heading in their paste is ### Steps I tried. Your prompt template uses ### as a stop sequence. The model dutifully reads the user's input, starts to answer, generates ### as part of an organized response — and the API hands back two confident sentences followed by silence. The ticket lands in your queue as "model quality regression." It is not. The fix is one line in the gateway.

Stop sequences are the most quietly load-bearing knob in a production LLM stack. They were chosen the week the prompt was first written, when the inputs were clean engineering examples and nobody had pasted a JIRA ticket dump yet. Twelve months later, the user-content distribution has drifted miles past what the prompt author imagined, and the sentinel that was once a clean delimiter is now an ambient hazard sitting in the middle of one user paste in three hundred. Nothing alerted. The eval suite still passes. The CSAT chart sags by half a point on the affected slice and stays there.

This is not a model problem. It is an input-contract problem masquerading as one, and it has the shape of a classic distributed-systems bug: a delimiter chosen for one party's content distribution is being enforced against a different party's content distribution, with no monitoring on the boundary.

Why the Sentinel Made Sense on Day One

Stop sequences exist for good reasons. They let the model end on a logical boundary instead of on a token-count cliff. They keep the response from running past a structural marker the parser is about to look for anyway. They reduce per-call cost by trimming generation as soon as the useful output is done. The OpenAI API allows up to four; Anthropic's Messages API accepts a list. The mechanic is simple: when the model emits the configured string, the server stops streaming, sets finish_reason to stop (or stop_sequence on Anthropic), and returns.

The problem is that the choice of sentinel is rarely revisited. Teams pick ###, User:, </response>, a closing brace, or a markdown horizontal rule, because those tokens appeared at the boundary of the original prompt and felt natural. Then user content joins the prompt — sometimes as a paste, sometimes as a quoted passage, sometimes as an attachment whose text was extracted by a preprocessing step that nobody on the AI team owns. The user-content distribution includes markdown headers, role-play dialogues, transcripts with User: and Assistant: labels, code with closing braces, and HTML with the same tags the prompt template uses. The probability that a given user payload contains the sentinel is small per request and inevitable at scale.

The effect on the model is subtle. The sentinel can fire two ways. The model can echo the sentinel back as part of its own output (when the user input contained it and the model is summarizing or quoting), causing immediate truncation. The model can also generate the sentinel naturally because its output style was nudged in that direction by the user content (markdown-heavy input often produces markdown-heavy output, and a prompt that sets ### as a stop sequence is implicitly asking the model not to generate H3 headings, which is fine until the user asks for an H3 heading).

The Eval Trap

Standard eval suites do not catch this. The eval cases were drawn either from synthetic test fixtures or from a curated subset of past production traffic, and in both cases the inputs were sanitary. There is no reason the eval team would seed ### into a test ticket — the test ticket is about whether the model summarizes correctly, not whether the I/O contract holds under adversarial whitespace. The model passes every eval case, the truncation manifests only in production, and the gap between eval and production behavior reads as "model quality drift" because the dashboards do not separate truncation-by-sentinel from truncation-by-token-limit.

The instrumentation that closes the gap is not exotic, but it has to be built deliberately. Three signals matter.

The first is the finish_reason (or stop_reason) histogram, sliced by route, model, prompt version, and — critically — by whether the user input contained any of the configured sentinels. A spike in stop-by-sentinel responses correlated with sentinel-containing inputs is the smoking gun. Most teams do not log whether the sentinel appeared in the input at all, because the input-sanitation layer does not know what the prompt's stop config is, and the prompt layer does not know what the input contained. This is the seam the bug lives in.

The second is a length-percentile monitor on output token counts, again sliced by sentinel-in-input. The truncated outputs sit in the lower tail; if you compare the p10 of "sentinel present" to the p10 of "sentinel absent," the gap is the magnitude of the bug.

The third is a per-user-segment retry rate. Users who hit a truncation rarely accept it silently; they retry, often by rephrasing. That retry doubles the bill on the affected slice and adds latency to a user whose prior interaction already failed. Tracking retries-after-short-response is a leading indicator that converges on the same root cause from the other side.

The Reserved Namespace Discipline

The structural fix is to draw stop sequences from a namespace the user cannot accidentally produce. There are three viable approaches and they trade off against each other.

Model-specific special tokens are the cleanest option when the model exposes them. Many chat-tuned models reserve tokens like end-of-turn or end-of-message for exactly this purpose, and these tokens cannot appear in a user paste because the tokenizer encodes the byte sequence differently — there is no UTF-8 string a user can type that decodes to the special token. This is the same discipline tokenizers use to keep system prompts from being injected character-by-character, and it generalizes naturally to delimiters. The tradeoff is that you give up some portability across model providers and you have to track the special-token mapping per model version.

UUID sentinels work everywhere and require no model cooperation. Generate a fresh UUID per request, splice it into the prompt as the delimiter, and pass it as the stop sequence. The probability that any user content contains a fresh random UUID is effectively zero, and the technique survives provider migration. The cost is a few extra tokens per request and a small amount of prompt-template plumbing.

Unicode private-use characters land between the two. They are valid Unicode codepoints from a range the standard reserves for application-specific use, so well-behaved input pipelines never produce them. They are short (usually one or two tokens), invisible in most rendering pipelines, and free to use across providers. The risk is that a buggy preprocessing step somewhere upstream decides to "normalize" them out, at which point your delimiter disappears and the whole prompt stops parsing. Test your full input pipeline before adopting them.

The wrong move, but a tempting one, is to keep the human-readable sentinel and add an input-sanitization pass that escapes or strips the sentinel from user content. This works until the day the sanitizer has a bug, or until the day a new ingestion path is added that bypasses the sanitizer, or until the day a downstream consumer relies on the sanitized text being literal. Defense in depth is worth doing, but the primary defense should be a delimiter the user cannot produce, not a delimiter the user can produce that you hope to scrub.

The Adversarial Eval Seed

Even with a reserved-namespace delimiter, the eval suite needs to be seeded with adversarial inputs. The point is not just to catch the current bug but to catch the next one — a future engineer might add a new prompt template with a new stop sequence, and the same regression will land if the eval set does not already test for sentinel collisions.

Build a corpus of adversarial inputs that includes every known sentinel from the prompt catalog, plus common cousins (####, **, ---, code fences, every standard role label, common closing tags). Run the corpus against every prompt version in CI. The signal is not "did the model give a good answer" — it is "did the response truncate before reaching the configured minimum length." A truncation-detection grader is much cheaper than a quality grader and catches the failure mode directly.

The corpus has to be refreshed when the catalog changes. The lightweight discipline is a CI step that diffs the current catalog of stop sequences against the corpus and fails the build if any sentinel is missing from the adversarial set. This converts "remember to update the eval when you change the prompt" into a build-time check that does not depend on memory.

Why the Bill Is Worse Than It Looks

The cost frame on this bug is bimodal and worth understanding. The visible cost is the truncated response itself: the user gets a worse answer, sometimes notices, sometimes does not. The invisible cost is the retry — when a user gets a half-thought response, the retry rate climbs, and each retry pays full inference cost. On the affected slice the per-user inference bill roughly doubles, and the doubling is concentrated on the segment of users who paste the most context, who tend to be the highest-intent power users.

The cost is also asymmetric across user tiers. Free-tier users may not retry at all and just churn. Enterprise users retry until they get an answer and then file a ticket. The free-tier signal is invisible (the user is gone), the enterprise signal is delayed (the ticket arrives a week later through support), and neither hits the dashboard the AI team watches in real time. The dashboard reads "average response length looks stable" because the rest of the distribution is fine, and the bug persists for months.

The Architectural Reframing

The deeper realization is that the stop sequence is part of your input contract with the model, not just an output formatting choice. Treat it the way you treat any other delimiter in a parser: reserved, escaped at the boundary, monitored for collisions, and tested against adversarial input. The team that treats it as an output knob — something the prompt author tunes when they want shorter responses — is shipping a parser that lies on a content distribution it never tested.

This generalizes. Every place where structured generation meets user-supplied text is a parser boundary, and every parser boundary needs the same hygiene: a delimiter the user cannot produce, validation at the seam, and instrumentation that surfaces collisions before they become support tickets. The prompt is not a format string with user data interpolated into it; it is a wire protocol with two distinct trust zones, and the contract between them is the team's responsibility to enforce.

The good news is that the fix is small once the bug is seen. A UUID sentinel and a finish_reason dashboard sliced by sentinel-in-input will catch most of the damage and prevent recurrence. The bad news is that without that dashboard the bug is invisible, and most teams will not build the dashboard until something forces them to. The forcing function is usually a customer who notices the truncation, files a ticket, and waits a month for the AI team to track it down to the gateway. Build the dashboard first and you skip the month.

References:Let's stay in touch and Follow me for more thoughts and updates