Skip to main content

The Debug Logger That Put Your System Prompt in a Customer-Readable Audit Feed

· 10 min read
Tian Pan
Software Engineer

A security-conscious customer pulled their tenant's audit export, opened the JSON, and read the verbatim refusal policy, retrieval pipeline structure, and a handful of internal product identifiers from a field called llm.request.system. No exploit. No prompt injection. No jailbreak. Just a log field your platform team added six months earlier so engineers could correlate prompt versions with incidents — surfaced through a feed your enterprise team had separately opened to tenants for SOC 2 reasons.

The disclosure happened during a normal Wednesday afternoon. Your security team got paged by the customer, not by an alert. The incident timeline doesn't show a deploy on the day of the leak — the misconfiguration shipped on the day the audit feed expanded its field allowlist, which was a different team, a different sprint, and a different ticket. Both reviewers signed off on what they were looking at. Neither was looking at the composition.

This is the failure mode that prompt-extraction research keeps treating as an adversarial problem when it is increasingly a configuration problem. Recent studies show system prompt extraction succeeding around 60% of the time across enterprise AI assessments, and the published mitigations are still framed as "harden the model against the user." But the leaks that show up in postmortems aren't from cleverly-worded jailbreaks. They are from the system prompt being written to a place your access-control model treats as ordinary metadata.

The seam nobody owns

A debug field's lifecycle has at least three handoffs that look uncontroversial in isolation.

The platform team adds llm.request.system to a structured log to support a real engineering need: correlating an in-flight prompt version with an incident. This is a defensible decision. Without it, you cannot answer "which version of the system prompt produced this answer" during an outage, and you cannot retire a bad prompt with confidence.

The observability team ingests the field into the trace store the same way it ingests every other field — by type. It is a string. Strings have a configured redactor that scrubs email addresses, credit card patterns, and anything matching a few PII regexes. The redactor does not recognize a system prompt because a system prompt does not look like PII. It looks like product copy.

The enterprise team, working from a different roadmap, extends the customer-visible audit feed to surface request-level fields so tenants can satisfy their own auditors. The allowlist is built by the product manager who knows which fields the customer is asking for. They include llm.request.system because tenants who run multi-model evaluations want to see which prompt their queries hit. The PM treats it like a debug field. The customer's auditor treats it like documented vendor behavior. Both are correct from their respective vantage points.

None of those three reviewers had the full picture. The platform team thought of the log as engineer-visible. The observability team thought sensitivity meant PII. The enterprise team thought the feed was scoped to a tenant. None of them named who owned the question "is the system prompt sensitive across all of these surfaces?" because that question crossed every team's edges.

The OWASP LLM Top 10 entry on system prompt leakage explicitly tells you to treat the system prompt as potentially public, and to never rely on it as a security control. That is good advice for what the prompt contains. It is not advice that helps you when the question is whether the prompt itself is the disclosure. A refusal policy isn't sensitive because it hides a credential. It is sensitive because it is product surface area — the encoded shape of how your assistant says no, what tools it has, what topics it routes around, and what your retrieval pipeline considers retrievable. That is competitive material. Sometimes it is regulatory material.

Why the redactor missed it

A PII redactor is trained on the assumption that sensitive data has a recognizable surface form. Phone numbers have shapes. Emails have shapes. Even API keys and JWTs have recognizable prefixes and entropy distributions. The reason regex- and ML-based redactors achieve any precision at all is that they exploit those structural priors.

A system prompt has the structural prior of "an instruction in your product's voice." It is indistinguishable from documentation, from a marketing FAQ, from a support macro. The redactor sees something that looks like content and waves it through. The Safe Observability research from the OpenTelemetry community has been pushing for hierarchical, residency-aware classification precisely because the regex era cannot solve this — sensitivity is a property of the field's provenance, not the field's shape.

This is why field-typed sensitivity tags are the closest mechanism to a real fix. The platform team that creates llm.request.system should be required to attach a classification at field-definition time, not at log-line time. The classification should be confidential-product or model-attribution-only or whatever your taxonomy is — but the absence of a tag should be a default-deny rather than a default-allow.

Datadog and similar log platforms have shipped APIs for restricting access to log paths, but the restriction is opt-in. The team adding the field has to remember that the field is sensitive and to wire up the restriction. Memory is not a security control. Defaulting unknown fields to the most permissive tier is what produces the seam.

The composition rule

The harder lesson is that two independently-correct decisions about visibility compose into a disclosure, and your existing review processes are not set up to catch compositions.

A change request to extend the audit feed allowlist is a small ticket. The reviewer asks: are these fields safe to share with the tenant? They look at each field in isolation, see a string, see that the field already has redaction applied for PII, see no flag from the data classification tool, and approve. The reviewer for the original llm.request.system change asked: is this field safe to log internally? They confirmed it does not contain user PII, confirmed engineers need it for incident triage, and approved.

Neither reviewer is the reviewer for the composition. There is no "audit feed × LLM field" reviewer in your org chart, and there shouldn't be — you cannot add a reviewer for every cross-product of features. The mechanism has to be the field's own classification, traveling with the field across systems. Treat sensitivity tags the way you treat type signatures: a function that accepts a string from one system has no idea what the string means unless the type tells it, and "string" is not a useful type.

What changes when prompts are product surface

If you accept that system prompts are part of the product — that they encode behavior the way a config file encodes behavior — a handful of decisions move:

System prompts become version-controlled artifacts with the same review discipline as your auth layer. They get diffed. They get changelog entries. They get attribution. You probably already do this informally. Making it formal means the prompt has a known set of consumers — your inference layer, your eval harness, your A/B framework — and any new consumer requires a security review, the same way adding a new consumer of your session-token table would.

Logging fields that reference prompts get sensitivity tags at definition time. A CI check fails the PR if a new field name matches a llm.request.* or agent.system.* pattern without a tag. The check does not need to understand the field's semantics — it needs to enforce that the engineer making the change made an explicit classification decision. The PR description carries the decision into the review.

The customer-visible audit feed and the engineer-visible trace store have separately-maintained allowlists, and both default-deny. Adding a field to either is a deliberate act. The team that owns the audit feed has a documented process for what kinds of fields a tenant should see, and that process explicitly lists "any LLM request or response field" as requiring classification review.

Red-team passes that previously focused on prompt extraction at inference time also run against your customer-facing audit feeds, log exports, and trace UIs. The red-team script is trivial — look for prompt-shaped strings in any tenant-accessible export — and it catches the exact failure mode that defeats your inference-layer mitigations. If your existing pen-test coverage doesn't include this, the cost of adding it is one engineer-day.

The architectural lesson the incident review will rediscover

Every postmortem for this category lands on a version of the same finding: the disclosure required no attacker. The combined system did the work an attacker would have had to do. The model never extracted the prompt because the prompt was never asked. The audit feed handed it over.

The temptation in the writeup is to recommend a tighter audit feed allowlist and a smarter redactor. Both are correct as immediate remediations. Neither addresses the underlying property: that an LLM application has many more surfaces that touch the prompt than a non-LLM application has surfaces that touch its config. Every observability tool, every replay harness, every eval pipeline, every A/B logger sees the prompt because the prompt is what made the call interesting to log. Each of those is a potential exfiltration path, and each one's reviewer is a different person.

The architectural move is to push classification into the field's origin and force it to travel. The logger that emits llm.request.system should be calling an API that requires a sensitivity tag and that refuses to emit if the tag is missing. The trace store should refuse to index untagged fields above a certain pattern threshold. The audit feed should refuse to surface fields above a certain sensitivity tier without an explicit tenant-export approval recorded against a ticket. None of these are novel mechanisms. They are how you already treat session tokens, payment data, and PII. The shift is recognizing that the prompt belongs in that company.

Prompt extraction is not a research problem you have because attackers got clever. It is an architectural problem you have because your system prompts are valuable product assets currently being treated as debug strings. The next incident in your industry will be one of the two — a customer noticing the field in their audit feed and quietly archiving it, or a competitor reading your refusal policy out of a log export they paid for. The first you'll see in your support queue. The second you may never see at all. The teams that close this gap before either happens are the ones that started classifying their LLM fields like they classify their tokens — by where the field was born, not by what the field looks like.

References:Let's stay in touch and Follow me for more thoughts and updates