Red-Teaming AI Agents: The Adversarial Testing Methodology That Finds Real Failures

March 14, 2026 · 9 min read

Software Engineer

A financial services agent scored 11 out of 100 — LOW risk — on a standard jailbreak test suite. Contextual red-teaming, which first profiled the agent's actual tool access and database schema, then constructed targeted attacks, found something different: a movie roleplay technique could instruct the agent to shuffle $440,000 across 88 wallets, execute unauthorized SQL queries, and expose cross-account transaction history. The generic test suite had no knowledge the agent held a withdraw_funds tool. It was testing a different system than the one deployed.

That gap — 60 risk score points — is the problem with applying traditional red-teaming methodology to AI agents. Agents don't just respond; they plan, reason across multiple steps, hold real credentials, and take irreversible actions in the world. Testing whether you can get one to say something harmful is not the same as testing whether you can get it to do something harmful.

Why Agentic Red-Teaming Is Structurally Different

Traditional software red-teaming asks: can I bypass authentication, inject SQL, overflow a buffer? The threat model is about compromising the system. Agentic red-teaming asks a different question: can I convince an autonomous system to plan and act against its operator's interests, while appearing to behave normally?

Several properties of agents make this distinctly harder to test:

Non-determinism. The same attack may fail 90% of the time and succeed 10%. A finding that shows up on one run might not reproduce on the next. NIST's large-scale red-teaming competition — 250,000+ attack attempts, 400+ participants, 13 frontier models — found that probabilistic retrying alone could push attack success rates from 57% to over 80% on some scenarios. Aggregate pass/fail metrics obscure this. A 5% average attack success rate can mask a 57% success rate on one specific high-consequence action like a fund transfer.

Tool access is the blast radius. An agent with access to email, code execution, a database, and an external API is a fundamentally different target than a chatbot. The tools define what a successful attack can achieve. Generic jailbreak probes, which don't know what tools an agent has, cannot assess this.

Multi-step reasoning. A single-turn injection is the simple case. More dangerous are "boiling frog" attacks: a sequence of requests that each appear benign but collectively shift the agent's goal. "Analyze trends" → "export the contacts that match this trend" → "upload to this external endpoint for visualization." No single step looks alarming; the arc of the plan is the exploit.

Persistence. Agents with RAG stores or cross-session memory can be attacked by poisoning the knowledge base rather than the live context window. An attacker who plants false data in the memory layer influences every future decision without needing to repeat the injection.

Start with Reconnaissance, Not Prompts

The biggest mistake teams make when red-teaming agents is reaching for an attack prompt library before mapping the attack surface. Everything flows from what the agent can actually do.

Before writing a single attack, build an inventory:

Every tool the agent can invoke, with its parameters and side effects
Every database schema and external API it accesses
Every credential it holds and what those credentials authorize
Every communication channel it reads from (email, Slack, web content, documents)
Every persistent store it reads and writes to

This is what Palo Alto's contextual methodology calls the profiler phase: a simulated attacker that first conducts adversarial reconnaissance before launching targeted campaigns. The 60-point risk score difference in the financial services case study came entirely from this step. Without knowing the agent had withdraw_funds, there was nothing to target.

From the inventory, build a threat model. For each tool or capability, ask: what could an attacker achieve if this were misused? Prioritize by blast radius × exploitability × detection difficulty. Fund transfers and code execution rank differently than read-only database queries.

Four Attack Layers

Once you have an attack surface map, work through four layers sequentially. Each layer builds on the previous one.

Layer 1: Direct prompt attacks. These are the attacks most teams already run. Jailbreak attempts, roleplay framings, instruction override attempts, developer mode tricks. These find surface-level vulnerabilities but miss most agentic-specific risks. Run them, but don't stop here.

Layer 2: Tool-level attacks. Target specific capabilities. Can you invoke a tool with unauthorized parameters? Can you bypass rate limits? Can you chain tools in sequences the designer didn't anticipate — like using a code execution tool to write to a file path that another tool will subsequently read and act on? Can you trigger tool errors that expose internal state?

Layer 3: Multi-agent attacks. If the agent orchestrates other agents or is orchestrated by one, the inter-agent communication layer is a fresh attack surface. Simulate man-in-the-middle attacks: intercept and modify messages between agents in a pipeline. Test whether a downstream agent blindly trusts upstream agent output, or whether it validates the source and content of instructions it receives. A single compromised upstream agent can cascade bad instructions through an entire pipeline.

Layer 4: Persistent attacks. Test memory poisoning and cross-session manipulation. Plant false data in the RAG store and measure how it affects future agent decisions. Test whether the agent's behavior across sessions can be incrementally reshaped by injections that individually appear below the detection threshold.

Where Goal Hijacking Actually Hides

Goal hijacking — replacing the agent's objective rather than just overriding a single instruction — is the top-ranked risk in the OWASP Agentic Top 10. It most commonly enters through content the agent is supposed to process: emails, documents, web pages, search results.

White-on-white hidden text in PDFs can redirect a document-summarization agent into exfiltrating API keys. Invisible HTML elements — rendered in a page the agent fetches as part of its task — can inject new objectives into its planning loop. Malicious directives embedded in email bodies can instruct email-assistant agents to forward sensitive content to attacker-controlled addresses.

The Q4 2025 field data from Lakera AI confirmed that indirect prompt injection attacks — attacks that arrive through external content the agent processes, rather than directly through the user interface — succeed with fewer attempts and broader impact than direct attacks. They're also harder to reproduce in a test environment, because they require constructing plausible external content rather than crafting a single malicious prompt.

Testing for this requires feeding the agent realistic, adversarially crafted external content during the red-team exercise. It's not enough to probe the agent's response to direct user inputs.

Measuring What Actually Matters

Agentic red-team reports need per-task success rates, not aggregate metrics. NIST's AgentDojo evaluation is instructive: an upgraded Claude 3.5 Sonnet showed an 11% attack success rate against baseline attacks — easy to characterize as "robust." An adaptive red-team campaign that iteratively refined its approach achieved 81% success. Some specific high-consequence injection tasks succeeded more than 57% of the time.

For any finding that involves an irreversible action — data exfiltration, fund transfer, code execution, external communication — report the per-task success rate across repeated attempts. Document the full attack chain. Include traces showing which tools were invoked and in what sequence. The goal is reproducibility: the report should let an engineer recreate the attack and verify the fix.

Also account for adversarial persistence. Attackers with repeated access will retry. If a particular attack path has a 15% per-attempt success rate and an attacker can make 20 attempts before being detected, the effective success probability approaches 95%. Severity classification should factor in realistic attempt volumes.

Tooling Overview

Several open-source tools are worth knowing:

Garak (NVIDIA) runs 100+ attack vectors with up to 20,000 prompts per run across 20+ AI platforms. Good for broad coverage with a prewritten probe library, especially if you need a quick baseline scan.

PyRIT (Microsoft) is a full orchestration framework for programmable multi-turn red-team campaigns. It manages conversation history, transforms prompts via converters, scores outcomes, and supports Crescendo (gradually escalating) attacks and Tree of Attacks with Pruning. Use it when you need granular control over complex campaigns. It integrates with Azure AI Foundry and is under active development (v0.11.0 as of early 2026).

Promptfoo generates context-specific attacks using AI rather than static libraries, includes a web UI, and maps findings to OWASP, NIST, and EU AI Act compliance categories. Worth running as a baseline scanner.

mcp-scan (Cisco, open-source) is purpose-built for scanning MCP server tool descriptions for supply chain poisoning. If your agent uses MCP servers, add this to the pipeline.

No single tool covers the full agentic attack surface. A practical starting point: Promptfoo for breadth across OWASP ASI categories, PyRIT for targeted multi-turn campaigns against specific high-priority tools, and mcp-scan if MCP servers are in scope.

Continuous Practice, Not a One-Time Gate

NIST's large-scale competition found that every single frontier model was successfully attacked — attack success rate did not correlate with model capability. More pointed: 12 published defenses against prompt injection and jailbreaking, when re-examined with adaptive attacks that iteratively refined their approach, showed attack success rates above 90% for most defenses that had originally claimed near-zero attack success. Defenses degrade against adaptive adversaries.

This makes one-time pre-release red-teaming insufficient. New model versions change behavior. New tools added to the agent create new attack surfaces. The threat landscape shifts. The EU AI Act (full compliance required August 2026) mandates adversarial testing for high-risk AI systems as an ongoing obligation, not a checkbox.

Practically, this means integrating red-team coverage into CI/CD. Automated scans on every agent change. Scheduled full-depth exercises before major capability additions. A feedback loop where findings from production anomalies feed back into the red-team scope.

The most important organizational shift: treat red-teaming as an engineering discipline that produces artifacts — inventories, threat models, per-task success rates, reproducible attack traces — not as a narrative summary of things that went wrong. That rigor is what lets you measure whether defenses are actually improving.

The goal is not to prove that an agent is safe. It's to find the failure modes before your users do.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Red-Teaming AI Agents: The Adversarial Testing Methodology That Finds Real Failures

Why Agentic Red-Teaming Is Structurally Different

Start with Reconnaissance, Not Prompts

Four Attack Layers

Where Goal Hijacking Actually Hides

Measuring What Actually Matters

Tooling Overview

Continuous Practice, Not a One-Time Gate

Recommended Reading

About Tian Pan

Why Agentic Red-Teaming Is Structurally Different​

Start with Reconnaissance, Not Prompts​

Four Attack Layers​

Where Goal Hijacking Actually Hides​

Measuring What Actually Matters​

Tooling Overview​

Continuous Practice, Not a One-Time Gate​

Recommended Reading

About Tian Pan

Why Agentic Red-Teaming Is Structurally Different

Start with Reconnaissance, Not Prompts

Four Attack Layers

Where Goal Hijacking Actually Hides

Measuring What Actually Matters

Tooling Overview

Continuous Practice, Not a One-Time Gate