The AI Observability Leak: Your Tracing Stack Is a Data Exfiltration Surface

April 23, 2026 · 11 min read

Software Engineer

A security team I talked to recently found that their prompt and response fields were being shipped, in full, to a third-party SaaS logging backend they had never signed a Data Processing Agreement with. The fields contained customer medical summaries, Stripe secret keys accidentally pasted by support agents, and the full text of a confidential acquisition memo that someone had asked an internal assistant to summarize. Nothing was encrypted in the payload. Nothing was redacted. The retention was 400 days. The integration was set up during a hackathon by a well-meaning engineer who pip install-ed the vendor's SDK, dropped in an API key, and shipped.

This is the AI observability leak. Every LLM app team ends up wanting tracing — you cannot debug prompt regressions or non-deterministic agent loops without it — so one of LangSmith, Langfuse, Helicone, Phoenix, Braintrust, or a vendor AI add-on ends up in the stack. The default setup captures the entire request and response. That default is, for most production workloads, a compliance violation waiting to be discovered.

The uncomfortable part is that nothing about this is the vendor's fault. Hosted tracing is doing exactly what you wired it up to do. The failure is organizational: legal owns your DPAs, security owns your data classification policy, and engineering owns the SDK config that silently decides which fields egress the trust boundary. No one is auditing the seam.

Free Tier Is Rarely Free

The narrative around AI tracing platforms has rightly shifted away from the "we train on your data" framing. LangSmith explicitly states it doesn't train on customer data. Langfuse Cloud is SOC 2 Type II and ISO 27001 certified. Helicone and most competitors have similar assertions. Treating "are they training on it?" as the primary risk misses the three risks that still matter:

Retention. LangSmith stores base traces for 14 days and extended traces (ones with feedback attached) for 400 days, and retention is not configurable on the hosted plan — only on self-hosted. That means a single thumbs-down click on a trace containing a customer's mental health query extends its lifetime by nearly a year. Your incident-response window for a data subject access request just got a lot wider.

Breach surface. Every hosted tracing platform is a soft target. Its whole value proposition is having full prompts, full responses, tool call arguments, and chain-of-thought in a searchable UI. A breach at your observability vendor — by any vector, including a support engineer's stolen laptop — is a breach of the most sensitive text your users ever typed. Your incident disclosure will have to name the vendor. Your customers did not sign up for that vendor.

Subpoena and lawful access. Trace data sitting in a third-party's US cloud is subject to US legal process. For EU customer data, that is a GDPR problem even when the vendor is technically compliant, because compliance does not immunize against Schrems-style challenges. Most teams have never thought about this until a regulator asks.

"Free" tiers are particularly rough because they tend to have the weakest retention controls and the loosest isolation guarantees. If you are shipping production prompts through a free tier, you are the product in the same structural sense that a free social network user is the product — not necessarily because of training, but because the vendor's economics require keeping your data cheap to store and slow to delete.

The SDK Is the Policy

Here is the pattern that gets teams in trouble. Legal negotiates a DPA. Security writes a data classification policy — PHI goes here, PII goes there, secrets never touch shared storage. Engineering wires up the AI SDK. The SDK ships {prompt, response, tool_calls} to api.hosted-vendor.com. Nobody ties the three together.

The effective privacy policy of your AI stack is not the PDF legal signed. It is whatever your @traceable decorator, langfuse.trace() call, or dd-trace instrumentation actually sends over the wire. If engineering flipped on a new SDK version that started capturing tool_result payloads, the privacy policy changed. No approvals, no ticket, just a minor version bump.

The Samsung incident in 2023 — where engineers pasted source code and meeting transcripts into ChatGPT and Samsung's data ended up in OpenAI's hands — became a cautionary tale about prompt inputs. The hosted-tracing version of that story is harder to reason about because it is not a careless human typing; it is your production code faithfully doing what it was told. There is no prompt to review. The leak is structural.

A Data Classification Schema That Actually Fits LLMs

Traditional data classification (public / internal / confidential / restricted) maps poorly to AI traces because a single LLM call can contain all four tiers simultaneously. The system prompt is internal. The user message is whatever the user typed, which could be any tier. The tool result is often restricted (a row from your customer database). The model output is a synthesis of everything above.

A workable schema for AI workloads treats each field independently and assigns one of four dispositions:

Loggable verbatim. Metadata: request ID, model name, latency, token counts, cost, status codes, tool names. Safe to send to any tracing backend; this is what observability actually needs.
Loggable hashed. Identifiers that help you join traces to sessions but never need to be human-readable in the UI: user ID, session ID, trace ID. Send a salted hash, keep the salt inside your infrastructure.
Loggable redacted. Prompt and response text after a PII/secret scrubber has replaced entities with type tokens (<EMAIL>, <CREDIT_CARD>, <API_KEY>). You lose the ability to reproduce a trace character-for-character but retain 90% of the debugging value.
Never loggable. Tool results against internal systems (DB rows, search results with PHI, S3 object contents), raw auth tokens, signed URLs, anything under legal hold. These never leave your VPC regardless of how the tracing SDK is configured.

The discipline this forces is that engineering has to decide disposition per field, not per service. The tracing SDK becomes a dumb pipe; the decisions live in a middleware layer you own.

Scrub Before Egress, Not After

The wrong place to redact is inside the tracing backend. Yes, Datadog's Sensitive Data Scanner can scan LLM traces and redact at ingestion. Yes, LangSmith and Langfuse have field-masking features. But by the time the data reaches the vendor's ingestion endpoint, it has already left your trust boundary. If the redaction has a bug, or a new field gets added that the vendor's rule library doesn't cover, or the vendor's storage is ever compromised, the unredacted data is the one that existed in flight.

The right architecture is a scrubber that sits between your application code and the tracing SDK. Two reasonable implementations:

Inline middleware. A wrapper around your tracing library that intercepts span attributes before they get serialized. In Python-land, an OpenTelemetry SpanProcessor is the canonical extension point; it runs before the exporter ships the span. Your processor applies a policy that knows which attributes are classified how, and either drops, hashes, or redacts each one.

AI gateway. A reverse proxy between your application and the model provider (and, ideally, between your application and any outbound observability endpoint) that enforces redaction in a single chokepoint. Teams already running Envoy, Kong, or a homegrown proxy find this the cleanest fit because the same proxy can also enforce rate limits, provider failover, and cost accounting.

The redaction engine itself can be Microsoft's Presidio (open source, combines NER with regex and checksum validators), a hosted PII classifier, or — if your domain has stable patterns — a handful of compiled regexes. The engine matters less than the placement: before the data crosses your network boundary, not after.

A trap worth naming: do not rely solely on regexes for entity types with semantic meaning. A regex catches \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b as an email, but misses "reach me at first dot last at gmail" written out. For structured secrets (AWS keys, JWTs, Stripe tokens), regex is usually sufficient and high-precision. For PII narratives, you need a real NER model.

The Zero-Retention Asymmetry

One awkward truth: the model providers themselves are often handling your data more carefully than the observability layer you added on top.

OpenAI, Anthropic, Azure OpenAI, and Vercel AI Gateway all support zero data retention (ZDR) agreements for enterprise customers, where eligible endpoints don't persist API inputs or outputs at all beyond the request lifetime. Anthropic reduced default API log retention from 30 to 7 days in late 2025. Most of this is opt-in and requires a sales conversation, but the capability exists.

Now compare: LangSmith retains base traces for 14 days and extended traces for 400 days on the hosted plan, non-configurable. Langfuse Cloud's retention depends on plan. Most vendor AI-APM add-ons inherit the parent platform's retention (often 15 months for metrics, 30 days for logs, by default).

The asymmetry is striking. Your ZDR deal with Anthropic buys you 0-day retention at the model. Your tracing setup then stores the same prompt and response, in plaintext, for up to 400 days at a third party you don't have a ZDR with. The legal work done upstream gets silently undone downstream. Teams discover this during their first serious privacy review and realize they have been retaining via the side channel what they explicitly contracted not to retain through the front door.

Self-Hosting Is a Real Option Again

The historical objection to self-hosted observability was operational: why run Postgres and ClickHouse when the vendor will do it for you? The calculus is shifting.

Langfuse is MIT-licensed and self-hostable on a few containers; the OSS version is the same codebase as the cloud. LangSmith offers a self-hosted tier. OpenLLMetry, Phoenix, and Helicone all have first-class self-host paths. Running these internally is not free, but it is now an afternoon of Terraform plus an ongoing on-call rotation, not a quarter-long platform project.

The trade-off is real: self-hosting means you own the storage, the backups, the TTL enforcement, the access controls, and the breach surface. It means you can't punt compliance to the vendor's SOC 2 report. What you gain is that the trust boundary stops at your VPC, which — for teams in healthcare, finance, legal, or anything EU-regulated — is often the only posture that actually passes a serious audit.

The useful question is not "self-hosted vs cloud" in the abstract, but "which of my data tiers can egress, and where does each tier live." A hybrid setup where metadata goes to a hosted backend and redacted-or-hashed content goes to a self-hosted backend is common and often correct.

What to Actually Do This Quarter

If you read this and suspect your stack is one of the bad patterns, the remediation path is short:

Grep your codebase for every tracing SDK, every @traceable, every track_llm_call, every dd-trace integration. Write down what each one captures. This is the list of what actually egresses, not what you think egresses.
Map each captured field to one of the four dispositions above. Anything in "never loggable" that is currently egressing is the priority-one fix.
Insert a middleware layer — span processor, AI gateway, or SDK wrapper — that enforces the disposition. Make it fail-closed: unknown fields default to redacted.
Review the retention policy of every tracing vendor you send data to. Compare it to your contractual and regulatory retention obligations. If they disagree, either the policy changes or the vendor does.
Bring legal and engineering into the same room once per quarter to walk through which fields actually go where. The seam between DPA and SDK config only gets audited when someone makes it a meeting.

The deeper lesson is that observability, like every other piece of infrastructure in an AI stack, is not a pure benefit. It concentrates your most sensitive data in a searchable form, in a system your users never consented to, operated by a vendor whose breach risk you don't control. That can still be the right trade-off — debuggable agents are worth a lot — but only if the trade is made explicitly, not inherited from a default SDK config.

Tracing that you don't own is, in the end, a database that you don't own. Treat it like one.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The AI Observability Leak: Your Tracing Stack Is a Data Exfiltration Surface

Free Tier Is Rarely Free

The SDK Is the Policy

A Data Classification Schema That Actually Fits LLMs

Scrub Before Egress, Not After

The Zero-Retention Asymmetry

Self-Hosting Is a Real Option Again

What to Actually Do This Quarter

Recommended Reading

About Tian Pan

Free Tier Is Rarely Free​

The SDK Is the Policy​

A Data Classification Schema That Actually Fits LLMs​

Scrub Before Egress, Not After​

The Zero-Retention Asymmetry​

Self-Hosting Is a Real Option Again​

What to Actually Do This Quarter​

Recommended Reading

About Tian Pan

Free Tier Is Rarely Free

The SDK Is the Policy

A Data Classification Schema That Actually Fits LLMs

Scrub Before Egress, Not After

The Zero-Retention Asymmetry

Self-Hosting Is a Real Option Again

What to Actually Do This Quarter