The Legal Disclaimer That Leaked From The Answer Into The Tool Call Arguments
Your counsel approved a one-line system-prompt directive: append "This information is not legal advice and should not be relied upon as such" to every response touching a regulated domain. Three weeks later, a user files a bug because their calendar event's description field opens with that same line, followed by a contract summary the agent was supposed to put into a meeting invite. The agent did not malfunction. It did exactly what the system prompt told it to do, which turned out to be a behavior that ranges over every channel the model produces text into — including the JSON arguments of the next tool it called.
The instruction was a content-formatting rule and the model treated it as one. It did not distinguish "user-facing response" from "tool call argument" because nothing in the prompt told it those were different surfaces. The disclaimer ended up in the calendar, in the email draft, in the Slack message your agent posted on the user's behalf. Each of these was a separate downstream system whose author had no idea a compliance string was about to be injected into a structured field, and each had a different cleanup cost.
This is the failure mode this article is about: safety boilerplate that escapes its intended audience because the prompt that defined it did not name an audience. The team prompted for a user-visible behavior and the model executed a content-shape rule. The shape flowed into machine-readable surfaces the prompt did not mention, and the consequences propagated to systems whose data the team had no rollback authority over.
A prompt directive is a behavior over every channel the model writes into
System prompts are not configuration. They are tokens the model reads on every forward pass, conditioning the next-token distribution alongside conversation history and retrieved context. A behavioral constraint in the system prompt influences every decision the model makes during that turn and every tool call that turn produces.
When you write "append this disclaimer to every response that touches regulated content," the model has to decide what counts as a response and what counts as touching regulated content. The first decision is the one that breaks. From the model's vantage, a tool call argument is a string it is producing in service of completing the user's request — and the user's request was about regulated content. The argument is downstream of the regulated context. The disclaimer applies.
The team's mental model was that "response" meant "the natural-language message rendered to the user's screen." The model's effective model is that "response" means "any text I generate this turn." Those two definitions are not the same, and the second one is the one the model is actually running.
The pattern shows up in subtler shapes too. A "be concise" directive can shorten tool call descriptions in ways that drop required entities. A "use markdown formatting" directive can inject backticks and asterisks into structured fields. A "speak in the user's preferred tone" directive can casualize the body field of an outbound email even when that email is going to the user's external counterparty. Each of these is the same shape: a prompt-side rule about output style applied to output channels the rule's author did not have in mind.
The audit cost is unbounded because the channel is irreversible
The user-facing version of this bug is annoying. The user sees the disclaimer in their calendar event description, reports it, and your platform team files a ticket. That is the cheap version.
The expensive version is that the calendar event already shipped to a CalDAV server. The Slack message already posted to a public channel. The Jira ticket already entered the integrations team's queue. The email draft turned into a sent email when a user hit send without re-reading the body. The text is now in systems your team does not own, in records whose owners did not opt in to receive your compliance boilerplate, and in some cases in records that get pulled into downstream pipelines (a Slack channel that feeds into a customer's analytics warehouse, a Jira project whose descriptions are indexed for ML training).
The conversation about cleaning up "calendar events created in the last three weeks that contain the disclaimer" is a conversation about reaching into thousands of users' personal Google Calendars to delete or edit data your platform put there. You do not have the authority to do that without each user's explicit re-consent. The bug is, in a real sense, unrecoverable. The boilerplate is now part of your users' calendars, and the only way out is to apologize and move on.
This is what distinguishes the tool-argument leak from a normal prompt-engineering miss. The misformatted answer in a user-facing response is overwritten on the next turn. The text inside a structured field that triggered a side effect in someone else's system is permanent.
The architectural mistake is using one channel to control behavior in multiple channels
When teams encounter this bug, the instinct is to refine the prompt: tell the model to append the disclaimer only to "the natural-language response to the user, not to tool arguments." This works in testing for a week. Then a model upgrade arrives, the disambiguation phrase weakens, and the disclaimer starts appearing in tool arguments again on a 2% slice of traffic. You have a stochastic guardrail and a deterministic obligation.
The real architectural mistake is upstream. The team used the system prompt — a probabilistic channel that influences every token the model produces — to enforce a deterministic requirement that should have applied to exactly one surface. The disclaimer's correct home is a post-processing layer that runs on the final user-facing string, not a behavioral rule the model is asked to follow on every output channel it controls.
Once you accept that the prompt is the wrong control surface, the fix follows. The disclaimer becomes a string appended by code to the user-visible response object after the model finishes its turn. The system prompt loses the disclaimer instruction entirely. The model no longer has to make a runtime decision about which channels the disclaimer applies to, because the decision is made by the orchestration layer that knows which output is the user-facing one.
The contrast is worth naming explicitly. A prompt-side directive is "ask the model to do X every time it speaks." A post-processing directive is "the answer the user sees is X plus whatever the model said." The first depends on the model's compliance, the model's understanding of "speak," and the stability of both across upgrades. The second depends on a single line of code in your response handler.
The patterns that close the gap
The disclaimer-in-tool-arguments bug is the loud version of a broader class. Anything you put in the system prompt as a content-shape rule will, in some traffic share, escape into every text-shaped output channel the model controls. The patterns below assume that and design around it rather than asking the model to enforce the distinction.
Move boilerplate out of the prompt and into the renderer. If a string must appear in every user-facing response, append it in the code path that ships the response to the user. The model does not need to know the disclaimer exists. Your compliance team can audit the renderer and pin its behavior with a single test. The disclaimer cannot leak into tool arguments because the model never wrote it.
Validate tool-call arguments against a content allowlist before dispatch. Even with the prompt cleaned up, models occasionally produce text in structured fields that should not be there — license preambles in commit messages, "as an AI assistant" hedges in outbound chat replies. A pre-dispatch sanitizer that strips known boilerplate phrases from structured arguments closes the gap for the cases your prompt cleanup missed. Treat this layer as a defense in depth, not as the primary control.
Separate user-visible and machine-readable output at the schema level. Many tool-calling stacks return a single response object that mixes the model's natural-language reply with its tool calls. Splitting these into two channels — assistant_message and tool_calls, with the disclaimer appended to the first by code — makes it structurally impossible for the disclaimer to flow into the second. Pydantic-style output schemas can encode this separation as a contract.
Treat tool-argument review as a guardrail step, not a debug feature. A guardrail that surfaces unexpected text in structured arguments — anything matching the disclaimer pattern, anything longer than a configured threshold, anything containing legal hedging phrases — before the tool fires gives you a chance to catch the leak in production traffic without depending on user reports. The cost of an extra LLM-as-judge call per tool dispatch is rarely the budget killer the team assumes; the cost of a bug that wrote boilerplate into ten thousand calendars is much higher.
Test the prompt against tool-bearing traffic, not just chat traffic. The disclaimer instruction would have failed an eval that asked "given a tool-calling turn, does the model leak the disclaimer into the tool arguments?" Most teams' prompt evals are chat evals — single-turn natural-language responses. If your product surface includes tool calls, your eval surface has to include tool calls. The bug is invisible at the chat-eval layer and obvious at the tool-eval layer.
The architectural realization
Every instruction in the system prompt is an instruction the model applies to every output channel it controls. The model does not natively distinguish "the answer" from "the action." Both are text it produces in the same forward pass under the same conditioning. The team that writes a content-formatting rule in the prompt has written a rule the model will apply wherever it produces text, including into the parameters of the tools that take action in the world.
This is not a bug to be patched. It is a consequence of the fact that the system prompt is the wrong layer for any rule whose audience is one specific output channel. The right layer is the code that owns that channel. The disclaimer belongs in the response renderer. The tone rule belongs in the user-message formatter. The "be concise" rule belongs nowhere — it is a content-shape rule masquerading as a style rule, and it will leak.
The compliance posture your team wants is one where the disclaimer is guaranteed on every user-visible response and impossible on every machine-readable surface. The system prompt cannot give you that. A renderer can.
- https://witness.ai/blog/llm-system-prompt-leakage/
- https://cheatsheetseries.owasp.org/cheatsheets/LLM_Prompt_Injection_Prevention_Cheat_Sheet.html
- https://www.getmaxim.ai/articles/guardrails-in-agent-workflows-prompt-injection-defenses-tool-permissioning-and-safe-fallbacks/
- https://talhafakhar.medium.com/tool-calling-vs-function-calling-why-guardrails-and-observability-matter-more-than-your-prompts-806f8738212e
- https://www.wiz.io/academy/ai-security/llm-guardrails
- https://medium.com/data-science-collective/llms-should-reason-infrastructure-should-enforce-86493b936f84
- https://arxiv.org/html/2507.08030v1
- https://aisera.com/blog/agentic-ai-compliance/
- https://pydantic.dev/docs/ai/core-concepts/output/
- https://dev.to/thedailyagent/structured-outputs-vs-tool-calling-when-your-agent-actually-needs-which-kgk
