The prompt injection that survived your sanitizer because the agent read it through a tool
A team I talked to last month had a clean prompt-injection story. Their gateway ran every user message through a classifier. Anything that scored above a threshold got bounced with a polite error. They benchmarked it against a public adversarial set, hit 99.4% block rate, and shipped. Two weeks later, a customer-success ticket revealed that the agent had quietly drafted, approved, and sent an email instructing an internal billing tool to refund a stranger's invoice to a new account. The malicious instruction had never touched the user input. It came in through a Confluence page the agent fetched when the user asked, perfectly innocently, "what does our refund policy say?"
That is the failure mode no input sanitizer catches, and it is now the dominant prompt-injection vector in production agents. The classifier you trained on user prompts never saw the payload, because the payload arrived through a different door. By the time the bytes hit the model, the agent had already labeled them as "context I retrieved to help the user," not "untrusted text from a stranger on the internet." The model treats both with the same compliance instinct, because the model has no concept of trust at all.
