Skip to main content

299 posts tagged with "observability"

View all tags

The SIEM Bill Your AI Feature Forgot to Include

· 10 min read
Tian Pan
Software Engineer

The math is simple and nobody did it. Pre-AI, a single user action — "summarize this ticket," "send this email" — produced one application log line. Post-AI, the same action emits a request log, an LLM call trace, a tool-invocation span for each tool the agent called, a retrieval span per chunk it read, a response log, and an eval log if you sample for offline scoring. The fan-out for one user click is now 30 to 50 records on the floor of your observability pipeline, and that's before retries, before sub-agents, before the planner-executor split that 2x's everything again.

You shipped an AI feature in Q1. In Q2, your security director walks into a budget review with a Splunk renewal that's 4x higher than last cycle. Nobody on the AI team is in the room. The conversation that happens next — about who owns the cost, why the threat-detection rules stopped working, and whether legal hold on every conversation is actually mandatory — is a conversation you should have had at design time and didn't, because the cost didn't show up on the LLM invoice. It showed up downstream, in a tool the AI team has never logged into.

Why The Weekly Transcript Review Beats Your AI Dashboard

· 12 min read
Tian Pan
Software Engineer

The most underpriced asset in your AI organization is the hour every week when three people sit in a room and read what your product actually said to users. Not the aggregate scores. Not the rolling averages. Not the dashboard. The actual transcripts. The verbatim outputs. The lazy phrasing the model has quietly settled into. The intent your taxonomy doesn't have a bucket for. The user trying for the third time to express what they want, in three different ways, while your eval rubric scores all three turns "satisfactory."

Teams who institutionalize this hour develop a mental model of their AI feature their dashboards will never surface. Teams who skip it ship for six months on metrics that look fine and learn at the next QBR that the median experience drifted somewhere unfortunate when nobody was looking.

The Brownout Pattern: When Your LLM Provider Is Slow but Not Down

· 10 min read
Tian Pan
Software Engineer

The pager that wakes you at 3 a.m. for an outage is the easy one. The provider returned 503 for forty minutes, your fallback kicked in, your runbook fired, your post-mortem writes itself. The pager that does not wake you — the one that lets your support queue fill up over six hours while every dashboard stays green — is the brownout. The provider's API still answers. The status page still says "operational." Your p99 latency has quietly drifted from 2.1 seconds to 14 seconds, your error rate from 0.1% to 4%, and the only people who noticed are the users who already left.

Provider availability is not binary. The fallback story most teams write — "if provider is down, switch to backup" — is a state machine with two states for a continuous variable, and it does not fire when the provider is sad rather than dead. Building for brownouts is a different design problem than building for outages, and almost every production agent harness I have seen ships without solving it.

Eval Set Rot: Why Your Score Trends Up While Users Trend Down

· 10 min read
Tian Pan
Software Engineer

The eval score has been trending up for two quarters. The dashboard is green, the regression suite has not flagged a real failure since March, and the team has gotten faster at shipping prompt changes because the eval gives crisp pass/fail answers. Meanwhile, user-reported quality is sliding. NPS is down four points, the support queue is full of failure modes nobody has labels for, and the head of product has started asking why the evals look great if customers are angry.

The eval set is not lying. It is answering the question it was built to answer, six months ago, against the traffic distribution that existed in launch week. The product has shifted. The user base has shifted. The long-tail use cases the team did not anticipate at launch now make up a third of traffic. The eval set is still measuring the world that existed in week one, and the team is averaging today's model against yesterday's product.

This is eval set rot. It is one of the quietest failure modes in modern AI engineering, and it gets worse as the eval set gets bigger, because the people maintaining it confuse "more cases" with "better coverage."

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

Agent Incident Forensics: Capture Before You Need It

· 11 min read
Tian Pan
Software Engineer

The customer sends a screenshot to support on a Tuesday. Their account shows a refund posted six days ago that they never asked for. Your CRO forwards the screenshot with one question: "What produced this?" You know an agent did it — the audit log says actor: refund-agent-v3. But the prompt has been edited four times since. The model id rotated last Thursday when finance switched providers to chase a 12% cost cut. The system prompt is templated from three retrieved documents, and the retrieval index was reindexed Monday. The conversation history was trimmed by the runtime to fit a smaller context window.

You can tell the CRO the agent did it. You cannot tell them why. That gap — between knowing an action happened and being able to reconstruct the inputs that caused it — is the gap most agent teams discover the first time someone outside engineering asks a real forensic question.

Your Agent Release Notes List Files. Your Integrators Need Behavior Diffs.

· 13 min read
Tian Pan
Software Engineer

A platform team ships their weekly agent release on a Wednesday afternoon. The internal changelog is dutiful: three system-prompt commits, a model-alias bump from a -0815 snapshot to -1019, four edits to tool descriptions, a new eval-rubric weighting, and a refreshed retriever index. By Friday, the support queue has eighteen tickets that nobody on the platform team can pattern-match. Tickets two and seven say "the bot is suddenly refusing to summarize private repos." Ticket eleven says "every code block in the output now starts with a language tag, and our downstream parser breaks on it." Ticket fifteen says "tool X is being called twice as often on long inputs and we're hitting our rate limit."

None of these tickets reference any of the lines in the changelog. The platform team's release notes are a list of files moved. The integrator tickets are a list of behaviors changed. The two documents do not meet in the middle, and that gap is where the trust leaks out.

Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression

· 10 min read
Tian Pan
Software Engineer

The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.

The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.

A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.

Argument Hallucination Is a Drift Signal, Not a Model Bug

· 10 min read
Tian Pan
Software Engineer

The ticket says "model hallucinated a user ID." The triage label is model-quality. The fix is one more sentence in the system prompt. Six weeks later a different tool starts hallucinating a date format, and the loop runs again. After a year of this, the prompt has grown into a 4,000-token apology for the entire backend, and the team is convinced the model is just unreliable on tool arguments.

The model isn't unreliable. The model is a contract-conformance machine reading the contract you gave it — and the contract you gave it has been quietly drifting away from the contract on the other side of the wire. Most production "argument hallucinations" are not model failures. They are integration tests your tool description is silently failing, surfacing as model output because that is the only place in the stack where the divergence becomes visible.

Why Your Bias Eval Passes in CI and Fails in Deployment

· 10 min read
Tian Pan
Software Engineer

The fairness audit was a green checkmark in the release pipeline. The compliance team signed it off in March. The support tickets started landing in October — a cohort of users in a country the model had never been graded on, getting answers a fraction as useful as everyone else. Nothing about the model had changed. The audit had never been wrong about the model. It had been wrong about the world.

This is the failure mode that no one wants to name out loud: a static bias eval is a snapshot of fairness in a stream that has already drifted. The eval was not lying when it ran. It was telling you a true thing about a distribution that no longer existed. By the time the support team has enough tickets to file a pattern, the model has been unfair to that cohort for two quarters and the audit is a year stale.

Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season

· 12 min read
Tian Pan
Software Engineer

The dashboard fired its first regression alert on a Monday morning in late January. Quality score on the support assistant dropped three points overnight. No prompt change shipped over the weekend. No model swap. The eval suite — a hand-curated 800-row gold set that the team had built six months earlier — was unchanged. Somebody opened an incident.

Two days of bisecting later, the answer was uninteresting and structural. It was the first business Monday after the IRS opened tax filing for the year. Half the inbound queries had shifted from "where is my paycheck deposit" to "how do I report a 1099-K from a payment app." The eval set, sampled in summer, had nothing to say about a 1099-K. The model wasn't worse. The customer was different. The gate was calibrated against a customer who no longer existed.

This pattern repeats every quarter in every product that has a seasonal user — fintech in tax season, sales tools at end-of-quarter, education at back-to-school, e-commerce in returns season, travel at booking season, healthcare at enrollment season. The eval-set-as-fixed-asset is a comfortable abstraction, and it is wrong on a calendar that nobody updates.

Your Gold Eval Set Has Drifted and Its Pass Rate Is the Reason You Can't See It

· 12 min read
Tian Pan
Software Engineer

The gold eval set passes at 94%. The model has been bumped twice this quarter, the prompt has been edited eleven times, the tool catalog has grown by four, and the dashboard is still green. Then a sales engineer forwards a transcript where the agent confidently routes a customer to a workflow that was sunset two months ago, and the head of support quietly opens a thread asking why the satisfaction scores have been sliding for six weeks while the eval pipeline reports no regressions. The gold set isn't lying. It's measuring last quarter's product against this quarter's traffic, and nobody asked it to do anything else.

This is the failure mode evaluation systems make hardest to see, because the instrument that's supposed to detect quality regressions is itself the source of the false positive. Pass rate is computed against the items in the set; the items in the set were curated against a snapshot of usage; usage moved on; the rate stayed clean. The team trusts the green dashboard, ships another model upgrade, and discovers months later that the production distribution has been measuring something different than the eval set has been measuring for longer than anyone wants to admit.

The fix is not to refresh the gold set more often. Refresh cadence is the wrong knob; the right knob is having a second instrument calibrated to a different time window so disagreement between the two surfaces drift before users do. That second instrument is the shadow eval — a parallel set rebuilt continuously from current production traffic, run alongside the gold set, with the explicit job of disagreeing with it.