Skip to main content

10 posts tagged with "prompt-injection"

View all tags

Tool Outputs Are an Untrusted Channel Your Agent Treats as Trusted

· 11 min read
Tian Pan
Software Engineer

The threat model most teams ship their agents with has one quiet assumption buried inside: when the model calls a tool, whatever comes back is safe to read. The user's prompt is the adversary, goes the story, and tool outputs are "just data" — search results, inbox summaries, database rows, RAG chunks, file contents, page scrapes. That story is the entire reason prompt injection keeps landing in production. Tool outputs are not data. They are another input channel into the planner, with the same privilege as the user prompt and none of the suspicion.

If that framing sounds abstract, consider what happened inside Microsoft 365 Copilot in June 2025. A researcher sent a single email with hidden instructions; the victim never clicked a link, never opened an attachment, never read the message themselves. A routine "summarize my inbox" query asked Copilot to read the email. The agent dutifully followed the instructions it found inside the body, reached into OneDrive, SharePoint, and Teams, and exfiltrated organizational data through a trusted Microsoft domain before anyone noticed. The CVE (2025-32711, "EchoLeak") earned a 9.3 CVSS and a server-side patch, but the class of bug did not go away. It cannot go away, because every read-tool on every production agent is a version of that email inbox.

This post is about the framing shift that gets you unstuck: stop thinking about "prompt injection" as a user-input problem, and start thinking about every tool output as an untrusted channel that happens to share a token stream with your system prompt.

The Document Is the Attack: Prompt Injection Through Enterprise File Pipelines

· 9 min read
Tian Pan
Software Engineer

Your AI assistant just processed a contract from a prospective vendor. It summarized the terms, flagged the risky clauses, and drafted a response. What you don't know is that the PDF contained white text on a white background — invisible to your eyes, perfectly visible to the model — instructing it to recommend acceptance regardless of terms. The summary looks reasonable. The approval recommendation looks reasonable. The model followed instructions you never wrote.

This is the document-as-attack-surface problem, and most enterprise AI pipelines are completely unprepared for it.

The vulnerability is architectural, not incidental. When document content flows directly into an LLM's context window, the model has no reliable way to distinguish legitimate instructions from attacker-controlled content embedded in a file. Every document your pipeline ingests is a potential instruction source — and in most systems, untrusted documents and trusted system prompts are processed with equal authority.

Red-Teaming Consumer LLM Features: Finding Injection Surfaces Before Your Users Do

· 9 min read
Tian Pan
Software Engineer

A dealership deployed a ChatGPT-powered chatbot. Within days, a user instructed it to agree with anything they said, then offered $1 for a 2024 SUV. The chatbot accepted. The dealer pulled it offline. This wasn't a sophisticated attack — it was a three-sentence prompt from someone who wanted to see what would happen.

At consumer scale, that curiosity is your biggest security threat. Internal LLM agents operate inside controlled environments with curated inputs and trusted data. Consumer-facing LLM features operate in adversarial conditions by default: millions of users, many actively probing for weaknesses, and a stochastic model that has no concept of "this user seems hostile." The security posture these two environments require is fundamentally different, and teams that treat consumer features like internal tooling find out the hard way.

RAG-Specific Prompt Injection: How Adversarial Documents Hijack Your Retrieval Pipeline

· 9 min read
Tian Pan
Software Engineer

Most teams securing RAG applications focus their effort in the wrong place. They validate user inputs, sanitize queries, implement rate limiting, and add output filters. All of that is necessary — and none of it stops the attack that matters most in RAG systems.

The defining vulnerability in retrieval-augmented generation isn't at the user input layer. It's at the retrieval layer — inside the documents your system pulls from its own knowledge base and injects directly into the context window. An attacker who never sends a single request to your API can still compromise your system by planting a document in your corpus. Your input validation never fires. Your injection filters never trigger. The malicious instruction arrives in your LLM's context dressed as legitimate retrieved content, and the model executes it.

Document Injection: The Prompt Injection Vector Inside Every RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG security discussions focus on the generation layer — jailbreaks, system prompt leakage, output filtering. Practitioners spend weeks tuning guardrails on the model side while overlooking the ingestion pipeline that feeds it. The uncomfortable reality: every document your pipeline ingests is a potential instruction surface. A single PDF can override your system prompt, exfiltrate user data, or manipulate decisions without your logging infrastructure seeing anything unusual.

This isn't theoretical. Microsoft 365 Copilot, Slack AI, and commercial HR screening tools have all been exploited through this vector in the past two years. The same attack pattern appeared in 18 academic papers on arXiv, where researchers embedded hidden prompts to bias AI peer review systems in their favor.

Prompt Injection Surface Area Mapping: Find Every Attack Vector Before Attackers Do

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection surface area the wrong way: a security researcher posts a demo, a customer reports strange behavior, or an incident post-mortem reveals a tool call that should never have fired. By then the attack path is already documented and the blast radius is real.

Prompt injection is the OWASP #1 risk for LLM applications, but the framing as a single vulnerability obscures what it actually is: a family of attack vectors that scale with your application's complexity. Every external data source you feed into a prompt is a potential injection surface. In an agentic system with a dozen tool integrations, that surface area is enormous — and most of it is unmapped.

This post is a practitioner's methodology for mapping it before attackers do.

The Three Attack Surfaces in Multi-Agent Communication

· 10 min read
Tian Pan
Software Engineer

A recent study tested 17 frontier LLMs in multi-agent configurations and found that 82% of them would execute malicious commands when those commands arrived from a peer agent — even though the exact same commands were refused when issued directly by a user. That number should reset your threat model if you're shipping multi-agent systems. Your agents may be individually hardened. Together, they're not.

Multi-agent architectures introduce communication channels that most security thinking ignores. We harden the model, the system prompt, the API perimeter. We spend almost no time on what happens when Agent A sends a message to Agent B — who wrote that message, whether it was tampered with, whether the memory Agent B consulted was planted three sessions ago by an attacker who never touched Agent A at all.

Prompt Injection in Production: The Attack Patterns That Actually Work and How to Stop Them

· 11 min read
Tian Pan
Software Engineer

Prompt injection is the number one vulnerability in the OWASP Top 10 for LLM applications — and the gap between how engineers think it works and how attackers actually exploit it keeps getting wider. A 2024 study tested 36 production LLM-integrated applications and found 31 susceptible. A 2025 red-team found that 100% of published prompt defenses could be bypassed by human attackers given enough attempts.

The hard truth: the naive defenses most teams reach for first — system prompt warnings, keyword filters, output sanitization alone — fail against any attacker who tries more than one approach. What works is architectural: separating privilege, isolating untrusted data, and constraining what an LLM can actually do based on what it has seen.

This post is a field guide for engineers building production systems. No CTF-style toy examples — just the attack patterns causing real incidents and the defense patterns that measurably reduce risk.

The Lethal Trifecta: Why Your AI Agent Is One Email Away from a Data Breach

· 9 min read
Tian Pan
Software Engineer

In June 2025, a researcher sent a carefully crafted email to a Microsoft 365 Copilot user. No link was clicked. No attachment opened. The email arrived, Copilot read it during a routine summarization task, and within seconds the AI began exfiltrating files from OneDrive, SharePoint, and Teams — silently transmitting contents to an attacker-controlled server by encoding data into image URLs it asked to "render." The victim never knew it happened.

This wasn't a novel zero-day in the traditional sense. There was no buffer overflow, no SQL injection. The vulnerability was architectural: the system combined three capabilities that, individually, seem like obvious product features. Together, they form what's now called the Lethal Trifecta.

LLM Guardrails in Production: Why One Layer Is Never Enough

· 10 min read
Tian Pan
Software Engineer

Here is a math problem that catches teams off guard: if you stack five guardrails and each one operates at 90% accuracy, your overall system correctness is not 90%—it is 59%. Stack ten guards at the same accuracy and you get under 35%. The compound error problem means that "adding more guardrails" can make a system less reliable than adding fewer, better-calibrated ones. Most teams discover this only after they've wired up a sprawling moderation pipeline and started watching their false-positive rate climb past anything users will tolerate.

Guardrails are not optional for production LLM applications. Hallucinations appear in roughly 31% of real-world LLM responses under normal conditions, and that figure climbs to 60–88% in regulated domains like law and medicine. Jailbreak attacks against modern models succeed at rates ranging from 57% to near-100% depending on the technique. But treating guardrails as a bolt-on compliance checkbox—rather than a carefully designed subsystem—is how teams end up with systems that block legitimate requests constantly while still missing adversarial ones.