Skip to main content

2 posts tagged with "ai-observability"

View all tags

The AI Observability Leak: Your Tracing Stack Is a Data Exfiltration Surface

· 11 min read
Tian Pan
Software Engineer

A security team I talked to recently found that their prompt and response fields were being shipped, in full, to a third-party SaaS logging backend they had never signed a Data Processing Agreement with. The fields contained customer medical summaries, Stripe secret keys accidentally pasted by support agents, and the full text of a confidential acquisition memo that someone had asked an internal assistant to summarize. Nothing was encrypted in the payload. Nothing was redacted. The retention was 400 days. The integration was set up during a hackathon by a well-meaning engineer who pip install-ed the vendor's SDK, dropped in an API key, and shipped.

This is the AI observability leak. Every LLM app team ends up wanting tracing — you cannot debug prompt regressions or non-deterministic agent loops without it — so one of LangSmith, Langfuse, Helicone, Phoenix, Braintrust, or a vendor AI add-on ends up in the stack. The default setup captures the entire request and response. That default is, for most production workloads, a compliance violation waiting to be discovered.

The Debug Tax: Why Debugging AI Systems Takes 10x Longer Than Building Them

· 10 min read
Tian Pan
Software Engineer

Building an LLM feature takes days. Debugging it in production takes weeks. This asymmetry — the debug tax — is the defining cost structure of AI engineering in 2026, and most teams don't account for it until they're already drowning.

A 2025 METR study found that experienced developers using LLM-assisted coding tools were actually 19% less productive, even as they perceived a 20% speedup. The gap between perceived and actual productivity is a microcosm of the larger problem: AI systems feel fast to build because the hard part — debugging probabilistic behavior in production — hasn't started yet.

The debug tax isn't a skill issue. It's a structural property of systems built on probabilistic inference. Traditional software fails with stack traces, error codes, and deterministic reproduction paths. LLM-based systems fail with plausible but wrong answers, intermittent quality degradation, and failures that can't be reproduced because the same input produces different outputs on consecutive runs. Debugging these systems requires fundamentally different methodology, tooling, and mental models.