Skip to main content

233 posts tagged with "observability"

View all tags

Provider-Side Safety Drift: When Your Product Regresses Without a Deploy

· 9 min read
Tian Pan
Software Engineer

A prompt that worked on Tuesday returns "I can't help with that" on Thursday. The CI eval is green. The model name in your config didn't change. The prompt is byte-identical, hashed and pinned in source control. And yet a customer support thread is forming around the new refusal — the AI team won't see it for two weeks because it has to bubble through tier-one support, get triaged, and finally land on someone who can read the trace.

This is provider-side safety drift, and it is the most underbuilt monitoring gap in production AI today. Frontier providers tune safety filters, refusal thresholds, and content classifiers server-side on a cadence that is not on your release calendar. Your team isn't subscribed to it. There is often no release note. And the regressions are asymmetric in a way that is genuinely hard to detect: refusals creep up for legitimate intents while harmful queries you assumed the provider was filtering quietly start slipping through. The boundary moves on both sides, independently, without warning.

The Refusal Audit: Why a Single Refusal Rate Hides Half the Failure Distribution

· 10 min read
Tian Pan
Software Engineer

Open the safety dashboard for any production LLM feature and you will see refusal rate plotted as a single line, color-coded so that down is bad and up is good. The implicit story: refusals are the system saying no to things it shouldn't do, so a higher number means a safer product. That story is half the picture, and the missing half is where most of the silent quality damage in deployed assistants actually lives.

Refusal rate is a two-sided distribution. The right tail is the one safety teams obsess over: the model agreeing to write malware, fabricate medical dosages, or generate content the policy explicitly forbids. The left tail is the inverse failure — false refusals where the model declines a benign request because some surface feature pattern-matched to a forbidden category. A customer asking how to dispute a charge gets a "I can't give financial advice" boilerplate. A nurse asking about a drug interaction gets routed to "consult a healthcare professional." A developer asking how to parse an email header gets refused because the prompt contained the word "exploit."

The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory

· 11 min read
Tian Pan
Software Engineer

Three teams are looking at the same event stream, each with a column called session_id, and each with a different definition of what a session is. Billing inherited a 30-minute idle window from the auth library. Eval inherited "everything until the user says 'bye' or stops typing for 10 minutes" from a chatbot framework. Memory uses a thread ID that the UI generates whenever the user clicks "New chat" — which most users never do. Three columns, three semantics, one rolled-up dashboard, three unrelated bugs that share a root cause.

This is the session boundary problem. It looks like an instrumentation nit, but it is actually a product question wearing infrastructure clothes: where does a conversation end? The honest answer is that there is no single answer — a session for billing is not the same object as a session for eval is not the same object as a session for memory — and a team that picks one default and lets the other two inherit it is shipping a billing dispute, an eval bias, and a memory leak with the same root cause.

Smaller Model, Bigger Bill: Why Cheaper-Per-Token Often Costs More

· 8 min read
Tian Pan
Software Engineer

A finance-led mandate to "switch to the smaller model" is one of the most reliable ways to raise your LLM bill quarter-over-quarter. The dashboard the procurement team is watching — cost per call, average tokens per request — keeps trending down. Meanwhile the invoice keeps trending up. By the time someone reconciles the two, six months of prompt iteration has been spent compensating for a model that's worse at the task, and the team is in too deep to walk it back without admitting the original switch was a mistake.

The mistake isn't about pricing. It's about the unit. Per-token price is a misleading axis when reasoning depth, retry count, and prompt size all vary by model. The right metric is tokens-per-successful-completion, and on that axis the cheaper model often loses.

The Snapshot Trace Test: Production Traces as Your Regression Suite

· 10 min read
Tian Pan
Software Engineer

The eval set most teams run as their regression suite was hand-curated by an engineer in week three of the project, frozen by week six because nobody wanted to touch it before launch, and is now being used in month nine to gate deploys. The product has shifted twice. The user base has tripled. The cases the LLM actually sees in production overlap with that frozen suite by maybe forty percent. When the suite passes, nobody trusts it; when it fails, nobody knows whether the failure is real or whether the case is just stale. The team writes a doc proposing a "v2 eval set" and never gets around to it.

Meanwhile, every request the system has handled in production has been recorded in a tracing backend. Every prompt, every tool call, every intermediate output, every refusal, every retry — all of it sitting in object storage, time-indexed and span-tagged, ready to be replayed. The highest-fidelity test corpus the team will ever have is already on disk. They built an eval suite from scratch instead of reading from it.

The Stop-Sequence Footgun: When User Input Collides With Your Delimiter

· 10 min read
Tian Pan
Software Engineer

A user pastes a chunk of markdown into your support agent. The first heading in their paste is ### Steps I tried. Your prompt template uses ### as a stop sequence. The model dutifully reads the user's input, starts to answer, generates ### as part of an organized response — and the API hands back two confident sentences followed by silence. The ticket lands in your queue as "model quality regression." It is not. The fix is one line in the gateway.

Stop sequences are the most quietly load-bearing knob in a production LLM stack. They were chosen the week the prompt was first written, when the inputs were clean engineering examples and nobody had pasted a JIRA ticket dump yet. Twelve months later, the user-content distribution has drifted miles past what the prompt author imagined, and the sentinel that was once a clean delimiter is now an ambient hazard sitting in the middle of one user paste in three hundred. Nothing alerted. The eval suite still passes. The CSAT chart sags by half a point on the affected slice and stays there.

This is not a model problem. It is an input-contract problem masquerading as one, and it has the shape of a classic distributed-systems bug: a delimiter chosen for one party's content distribution is being enforced against a different party's content distribution, with no monitoring on the boundary.

Token-Aware Logging: When Your Traces Cost More Than the Inference They Observe

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter spent six weeks chasing a memory pressure alert on their agent platform. The agents were cheap — a few cents a run. The traces were not. Their telemetry pipeline was eating three times the budget of the LLM calls it was instrumenting, and most of the spend went to fields nobody had read in months: full prompt bodies stored on every span, tool outputs duplicated across parent and child traces, and an LLM-judge evaluator that re-paid the inference bill on every captured trace.

This is the AI observability cost crisis in miniature. A 2026 industry write-up modeled a customer support bot with 10,000 conversations and five turns each — that comes out to 200,000 LLM invocations, 400 million tokens, and roughly a million trace spans per day. Datadog users widely report observability bills jumping 40-200% after they instrument AI workloads on the same backend that handled their REST APIs. The pipeline is paying twice for the same tokens: once to generate them, once to remember them.

The fix is not "log less." The fix is to treat observability for AI systems as a workload with its own unit economics, separate from the request-response telemetry traditional services emit. Traditional logging is structured fields you can compress and forget; AI logging is unbounded text bodies that re-enter the inference budget every time something reads them. That distinction is what "token-aware logging" means.

Why AI Quality Monitors Conflate Model Drift, Data Drift, and Prompt Drift — and What to Do About Each

· 10 min read
Tian Pan
Software Engineer

A fraud detection model's accuracy silently halved over three weeks. Latency was normal, error rates were zero, and every infrastructure dashboard was green. Engineers spent the first week auditing the data pipeline, the second week comparing model weights, and the third week reopening tickets before someone noticed that fraudsters had simply changed their language patterns. The fix — retraining on recent examples — took two days. The misdiagnosis took three weeks.

This pattern repeats across production AI teams: degradation sets off a generalized "model problem" alarm, and the team starts pulling levers based on intuition rather than root cause. The reason isn't a lack of monitoring discipline; it's that most observability stacks treat three structurally distinct problems as one. Model drift, data drift, and prompt drift have different detection signatures, different alert topologies, and different remediation paths. Conflating them is how weeks get wasted on the wrong fix.

The AI Incident Postmortem Nobody Writes: A Four-Layer Diagnosis Framework

· 11 min read
Tian Pan
Software Engineer

When a recommendation engine surfaced offensive content last quarter, the post-incident review produced a familiar outcome: a two-hour call where ML engineers pointed at the retrieval corpus, data engineers pointed at the prompt, product engineers pointed at monitoring, and infrastructure pointed at the model version that nobody remembered upgrading. Three action items were created. None had owners. The incident closed. The same failure mode shipped again six weeks later.

This is not a story about one team. It is the default ending for AI incidents at most organizations. Responsibility for what an AI feature does in production is distributed across enough parties that a standard postmortem cannot pin causation. The 5-why analysis that works well for database timeouts breaks when the failure is "the model gave the wrong answer" — because the correct next question is never obvious.

Your AI Feature's Quiet Quitters: How to Detect Silent User Distrust

· 10 min read
Tian Pan
Software Engineer

The McDonald's drive-thru AI didn't fail because users complained. It failed because users stopped using the drive-thru. For three years the system logged healthy "acceptance rates" while viral videos showed customers pleading with it to remove 260 chicken nuggets from their order. When the partnership ended, the official reason was that the technology "wasn't yet ready." The real signal had been sitting in foot traffic data the whole time — unread, unmeasured, unreported.

This is the shape of most AI feature failures in production. Users don't disable your feature. They don't file tickets. They don't leave one-star reviews. They quietly route around it, and your dashboards keep showing green.

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

The Data Contract Problem in RAG: When Your Ingestion Pipeline Silently Breaks Retrieval Quality

· 10 min read
Tian Pan
Software Engineer

Your RAG system has a bug that doesn't throw exceptions. It doesn't spike your error rate. It doesn't show up in your latency dashboards. Instead, it quietly delivers confident, plausible-sounding answers that are wrong — and nobody notices for weeks.

This is the data contract problem in RAG: your ingestion pipeline is the source of truth for everything downstream, but it has no schema enforcement, no freshness guarantees, and no alerting when the shape of the world changes underneath it. Every time an upstream data source adds a field, a chunking parameter shifts, or an embedding model gets updated, your retrieval quality silently degrades.

Eighty percent of enterprise RAG projects experience critical failures in production. The most insidious of those failures don't announce themselves.