Skip to main content

10 posts tagged with "llm-security"

View all tags

When Your Forbidden List Becomes a Recipe: The Hidden Cost of Negative Examples in Prompts

· 10 min read
Tian Pan
Software Engineer

Open a mature production system prompt and search for the word "not." On a feature that has shipped through three quarters and survived a handful of incidents, you will almost always find a section that looks like a list of regrets — "do not give medical advice, do not generate code matching these patterns, do not produce content with this regex, do not impersonate these competitors, do not use these phrases." Each line traces back to a specific incident. Each line was added with confidence by an engineer who said "this will fix it." And the list grows, every quarter, in the same way a museum acquires exhibits.

What very few teams will admit out loud is that this list — the prompt's most defensive, most carefully reviewed section — is also the most useful artifact in the entire feature for the wrong reader. A determined user who extracts the system prompt now has a curated, organized, model-readable inventory of every behavior the team is afraid of. The forbidden list is a recipe. The team wrote the cookbook.

The Prompt-Injection Bug Bounty: Scoping a Program When 'Broken' Has No Clear Definition

· 12 min read
Tian Pan
Software Engineer

Your security team runs a bug bounty that works. A CSRF gets paid. An XSS gets paid. An IDOR gets paid. The rules of engagement are sharp, the severity rubric is industry-standard, the triage queue moves, and the program produces a steady stream of fixed bugs. Then your AI team ships a feature last quarter — a chat surface, an agent that calls tools, a RAG pipeline that pulls from customer data — and the question that lands on the security team's desk is "what's the bounty scope for this thing?" Nobody can answer.

The reason nobody can answer is that the standard bug bounty rubric was built around a system whose specified behavior is deterministic. A login endpoint either authenticates correctly or it doesn't. An access control check either holds or it doesn't. The AI feature you just shipped has no equivalent ground truth: its specified behavior is "respond helpfully to user input," and a researcher who makes it respond unhelpfully has not necessarily found a bug — they may have found something the model has always done, that nobody knew about, that you're not sure you can fix, and that may or may not reproduce on a second attempt.

Tool Composition Sandbox Escape: When Three Safe Tools Compose Into Data Exfiltration

· 10 min read
Tian Pan
Software Engineer

The security review approved each of the three tools individually. Read-only access to the customer database was rated low risk because the agent could see records but not modify them. Send-email-to-self was rated low risk because the recipient was hardcoded to a service-account inbox the agent was already authorized to write to. Template-render was rated low risk because it was a deterministic Jinja-style transform with no I/O. Three weeks after launch, the data-loss-prevention dashboard flagged customer PII showing up in a Slack channel that two hundred employees could read, and the post-mortem traced the leak to the agent composing the three tools into a chain that no single ACL had granted: read a customer record, render it through a template, email the result to its own service account whose inbox auto-forwarded into the channel.

No single tool was misused. No prompt injection bypassed any check. The agent did exactly what its tool catalog said it could do, and the composition produced a capability the security review had never been asked to evaluate.

The Bypass Vocabulary: When Users Learn to Jailbreak in Polite English

· 9 min read
Tian Pan
Software Engineer

The cheapest jailbreak in your production traffic isn't a clever Unicode trick or a chained adversarial suffix. It's three additional words a user typed after their first request got refused. They added "just hypothetically." They added "for a research paper." They added "for a fictional story I'm writing." The model complied. They told a friend. The friend posted a TikTok. By the end of the month, a non-trivial slice of your refusal-blocked traffic is being routed around with English so polite that none of your prompt-injection filters fire.

This is the failure mode the security team didn't put on the threat model. The threat model assumed adversaries were sophisticated, motivated, and technical. The actual adversary is a curious user who saw a screenshot. The vocabulary they're using doesn't show up in any public jailbreak corpus because by the time it hits a paper, the live distribution has moved on.

What Your Fine-Tuned LLM Is Leaking About Its Training Data

· 10 min read
Tian Pan
Software Engineer

When a team fine-tunes an LLM on customer support tickets, internal Slack exports, or proprietary code, the instinct is to treat data ingestion as a one-way door: data goes in, a better model comes out. That's not how it works. A researcher with API access and $200 can systematically pull verbatim text back out, often including content the model was never supposed to surface. This isn't a theoretical edge case — it's a documented attack pattern that has been demonstrated against production systems including one of the world's most widely deployed language models.

The core problem is that fine-tuned models are fundamentally different from base models in their privacy posture. They've been trained on smaller, more distinctive datasets where individual examples are far more distinguishable from background model behavior. That distinctiveness is exactly what attackers exploit.

Output As Payload: Your AI Threat Model Got Half The Boundary

· 9 min read
Tian Pan
Software Engineer

The threat model your team wrote for AI features almost certainly stops at the model. Inputs are untrusted: prompt injection, jailbreaks, adversarial uploads, poisoned retrieval. Outputs are content: things to moderate for safety, score on a refusal eval, ship to the user. The shape of that threat model is roughly "untrusted thing goes in, model thinks, safe thing comes out."

The new attack class flips that polarity. The model's output is rendered, parsed, executed, or relayed by a downstream system, and an attacker who can shape that output — through indirect prompt injection in retrieval, training-data influence, or socially engineered user queries — can deliver a payload to a target the model never had direct access to. The model becomes a confused deputy with reach the attacker doesn't have, and the boundary your team is defending is two systems too early.

EchoLeak is the canonical 2025 example. A single crafted email arrives in a Microsoft 365 mailbox. Copilot ingests it as part of routine context. The hidden instructions cause Copilot to embed sensitive context into a reference-style markdown link in its response, and the client interface auto-fetches the external image — exfiltrating chat logs, OneDrive content, and Teams messages without a single user click. Microsoft's input-side classifier was bypassed because the attack didn't need to break the model's refusal calibration. It needed to shape one specific token sequence in the output.

Your System Prompt Will Leak: Designing for Prompt Extraction

· 10 min read
Tian Pan
Software Engineer

The threat model for LLM features over-indexes on three failure modes: prompt injection, user-data exfiltration, and unauthorized tool calls. There is a quieter attack that lands more often, costs less to mount, and shows up in fewer postmortems because nobody filed one — prompt extraction. An adversarial user, sometimes a competitor, sometimes a curious researcher, walks the model into reciting its own system prompt over a handful of turns. The carefully tuned instructions that encode your team's product behavior, refusal policy, retrieval scaffolding, and brand voice land in a public GitHub repository within the week.

The repositories already exist. A widely-circulated GitHub project tracks extracted system prompts from Claude, ChatGPT, Gemini, Grok, Perplexity, Cursor, and v0.dev — updated as new model versions ship, often within hours of release. Anthropic's full Claude prompt clocks in at over 24,000 tokens including tools, and you can read it. The companies most invested in prompt secrecy are the ones whose prompts leak most reliably, because they are also the ones whose attackers are most motivated.

The AI Observability Leak: Your Tracing Stack Is a Data Exfiltration Surface

· 11 min read
Tian Pan
Software Engineer

A security team I talked to recently found that their prompt and response fields were being shipped, in full, to a third-party SaaS logging backend they had never signed a Data Processing Agreement with. The fields contained customer medical summaries, Stripe secret keys accidentally pasted by support agents, and the full text of a confidential acquisition memo that someone had asked an internal assistant to summarize. Nothing was encrypted in the payload. Nothing was redacted. The retention was 400 days. The integration was set up during a hackathon by a well-meaning engineer who pip install-ed the vendor's SDK, dropped in an API key, and shipped.

This is the AI observability leak. Every LLM app team ends up wanting tracing — you cannot debug prompt regressions or non-deterministic agent loops without it — so one of LangSmith, Langfuse, Helicone, Phoenix, Braintrust, or a vendor AI add-on ends up in the stack. The default setup captures the entire request and response. That default is, for most production workloads, a compliance violation waiting to be discovered.

Free Tier Abuse Economics: When Your AI Generosity Gets Ratio'd by Bots

· 10 min read
Tian Pan
Software Engineer

A startup CTO checked their OpenAI dashboard one morning and found a $67,000 invoice. Their normal monthly bill was $400. Nothing in their product had changed — no viral launch, no new feature, no marketing push. What had changed is that an attacker fingerprinted their endpoint, harvested a leaked key from a build artifact, and resold the inference at 40-60% below retail to buyers who paid in crypto. The startup paid the bill while the attacker pocketed the spread.

This is not the typical free tier abuse story SaaS founders tell each other. The typical story goes: a few power users abuse generous trials, churn rates spike, you tighten the limits, and unit economics recover within a quarter. That playbook is dead for AI products. The math broke when your unit cost per anonymous request stopped being effectively zero, and the abuse playbook scaled the moment your generosity could be liquidated for cash.

Prompt Injection in Production: The Attack Patterns That Actually Work and How to Stop Them

· 11 min read
Tian Pan
Software Engineer

Prompt injection is the number one vulnerability in the OWASP Top 10 for LLM applications — and the gap between how engineers think it works and how attackers actually exploit it keeps getting wider. A 2024 study tested 36 production LLM-integrated applications and found 31 susceptible. A 2025 red-team found that 100% of published prompt defenses could be bypassed by human attackers given enough attempts.

The hard truth: the naive defenses most teams reach for first — system prompt warnings, keyword filters, output sanitization alone — fail against any attacker who tries more than one approach. What works is architectural: separating privilege, isolating untrusted data, and constraining what an LLM can actually do based on what it has seen.

This post is a field guide for engineers building production systems. No CTF-style toy examples — just the attack patterns causing real incidents and the defense patterns that measurably reduce risk.