Skip to main content

19 posts tagged with "prompt-injection"

View all tags

RAG-Specific Prompt Injection: How Adversarial Documents Hijack Your Retrieval Pipeline

· 9 min read
Tian Pan
Software Engineer

Most teams securing RAG applications focus their effort in the wrong place. They validate user inputs, sanitize queries, implement rate limiting, and add output filters. All of that is necessary — and none of it stops the attack that matters most in RAG systems.

The defining vulnerability in retrieval-augmented generation isn't at the user input layer. It's at the retrieval layer — inside the documents your system pulls from its own knowledge base and injects directly into the context window. An attacker who never sends a single request to your API can still compromise your system by planting a document in your corpus. Your input validation never fires. Your injection filters never trigger. The malicious instruction arrives in your LLM's context dressed as legitimate retrieved content, and the model executes it.

Document Injection: The Prompt Injection Vector Inside Every RAG Pipeline

· 10 min read
Tian Pan
Software Engineer

Most RAG security discussions focus on the generation layer — jailbreaks, system prompt leakage, output filtering. Practitioners spend weeks tuning guardrails on the model side while overlooking the ingestion pipeline that feeds it. The uncomfortable reality: every document your pipeline ingests is a potential instruction surface. A single PDF can override your system prompt, exfiltrate user data, or manipulate decisions without your logging infrastructure seeing anything unusual.

This isn't theoretical. Microsoft 365 Copilot, Slack AI, and commercial HR screening tools have all been exploited through this vector in the past two years. The same attack pattern appeared in 18 academic papers on arXiv, where researchers embedded hidden prompts to bias AI peer review systems in their favor.

Prompt Injection Surface Area Mapping: Find Every Attack Vector Before Attackers Do

· 11 min read
Tian Pan
Software Engineer

Most teams discover their prompt injection surface area the wrong way: a security researcher posts a demo, a customer reports strange behavior, or an incident post-mortem reveals a tool call that should never have fired. By then the attack path is already documented and the blast radius is real.

Prompt injection is the OWASP #1 risk for LLM applications, but the framing as a single vulnerability obscures what it actually is: a family of attack vectors that scale with your application's complexity. Every external data source you feed into a prompt is a potential injection surface. In an agentic system with a dozen tool integrations, that surface area is enormous — and most of it is unmapped.

This post is a practitioner's methodology for mapping it before attackers do.

The Three Attack Surfaces in Multi-Agent Communication

· 10 min read
Tian Pan
Software Engineer

A recent study tested 17 frontier LLMs in multi-agent configurations and found that 82% of them would execute malicious commands when those commands arrived from a peer agent — even though the exact same commands were refused when issued directly by a user. That number should reset your threat model if you're shipping multi-agent systems. Your agents may be individually hardened. Together, they're not.

Multi-agent architectures introduce communication channels that most security thinking ignores. We harden the model, the system prompt, the API perimeter. We spend almost no time on what happens when Agent A sends a message to Agent B — who wrote that message, whether it was tampered with, whether the memory Agent B consulted was planted three sessions ago by an attacker who never touched Agent A at all.

Prompt Injection in Production: The Attack Patterns That Actually Work and How to Stop Them

· 11 min read
Tian Pan
Software Engineer

Prompt injection is the number one vulnerability in the OWASP Top 10 for LLM applications — and the gap between how engineers think it works and how attackers actually exploit it keeps getting wider. A 2024 study tested 36 production LLM-integrated applications and found 31 susceptible. A 2025 red-team found that 100% of published prompt defenses could be bypassed by human attackers given enough attempts.

The hard truth: the naive defenses most teams reach for first — system prompt warnings, keyword filters, output sanitization alone — fail against any attacker who tries more than one approach. What works is architectural: separating privilege, isolating untrusted data, and constraining what an LLM can actually do based on what it has seen.

This post is a field guide for engineers building production systems. No CTF-style toy examples — just the attack patterns causing real incidents and the defense patterns that measurably reduce risk.

The Lethal Trifecta: Why Your AI Agent Is One Email Away from a Data Breach

· 9 min read
Tian Pan
Software Engineer

In June 2025, a researcher sent a carefully crafted email to a Microsoft 365 Copilot user. No link was clicked. No attachment opened. The email arrived, Copilot read it during a routine summarization task, and within seconds the AI began exfiltrating files from OneDrive, SharePoint, and Teams — silently transmitting contents to an attacker-controlled server by encoding data into image URLs it asked to "render." The victim never knew it happened.

This wasn't a novel zero-day in the traditional sense. There was no buffer overflow, no SQL injection. The vulnerability was architectural: the system combined three capabilities that, individually, seem like obvious product features. Together, they form what's now called the Lethal Trifecta.

LLM Guardrails in Production: Why One Layer Is Never Enough

· 10 min read
Tian Pan
Software Engineer

Here is a math problem that catches teams off guard: if you stack five guardrails and each one operates at 90% accuracy, your overall system correctness is not 90%—it is 59%. Stack ten guards at the same accuracy and you get under 35%. The compound error problem means that "adding more guardrails" can make a system less reliable than adding fewer, better-calibrated ones. Most teams discover this only after they've wired up a sprawling moderation pipeline and started watching their false-positive rate climb past anything users will tolerate.

Guardrails are not optional for production LLM applications. Hallucinations appear in roughly 31% of real-world LLM responses under normal conditions, and that figure climbs to 60–88% in regulated domains like law and medicine. Jailbreak attacks against modern models succeed at rates ranging from 57% to near-100% depending on the technique. But treating guardrails as a bolt-on compliance checkbox—rather than a carefully designed subsystem—is how teams end up with systems that block legitimate requests constantly while still missing adversarial ones.