Skip to main content

161 posts tagged with "observability"

View all tags

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

AI Product Metrics That Don't Lie: Behavioral Signals Over Thumbs-Up Scores

· 9 min read
Tian Pan
Software Engineer

Your AI feature has a 4.2/5 satisfaction score. Users click thumbs-up 68% of the time. The A/B test shows task completion rate is up 12%. Your team ships it. Six weeks later, users have quietly routed around it for anything they actually care about.

This is metric theater. You optimized for signals that look like success but aren't. The feedback you collected came from the 8% of users who bother rating anything — skewed toward the delighted and the furious, silent on the vast middle who found the feature unreliable just often enough to stop trusting it.

Building AI features requires a different measurement philosophy than traditional software. The signals you instrument from day one determine whether you learn fast enough to improve or spend six months chasing a satisfaction score that doesn't move.

Debugging AI at 3am: Incident Response for LLM-Powered Systems

· 10 min read
Tian Pan
Software Engineer

You're on-call. It's 3am. Your alert fires: customer satisfaction on the AI chat feature dropped 18% in the last hour. You open the logs and see... nothing. Every request returned HTTP 200. Latency is normal. No errors anywhere.

This is the AI incident experience. Traditional on-call muscle memory — grep for stack traces, find the exception, deploy the fix — doesn't work here. The system isn't broken. It's doing exactly what it was designed to do. The outputs are just wrong.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

The Implicit API Contract: What Your LLM Provider Doesn't Document

· 10 min read
Tian Pan
Software Engineer

Your LLM provider's SLA covers HTTP uptime and Time to First Token. It says nothing about whether the model will still follow your formatting instructions next month, refuse requests it accepted last week, or return valid JSON under edge-case conditions you haven't tested. Most engineering teams discover this the hard way — via a production incident, not a changelog.

This is the implicit API contract problem. Traditional APIs promise stable, documented behavior. LLM providers promise a connection. Everything between the request and what your application does with the response is on you.

Multi-Session Eval Design: Catching the AI Feature That Gets Worse Over Time

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed every eval at launch. Six weeks in, churn in the cohort that talks to it most has doubled, and your CSAT dashboard shows a flat line that no one can explain. The prompts haven't changed, the model hasn't been swapped, the retrieval index has grown but nobody thinks it's broken. What shipped was fine on turn one. What rots is what happens on turn four hundred, in session seventeen, three weeks after signup.

Most teams' eval suites can't see this failure. They test single-turn accuracy on a fixed dataset, maybe single-session multi-turn if they're ambitious, and then declare the feature shippable. The failure mode that matters — quality that degrades as the system accumulates state about a user — lives in a temporal dimension the eval harness was never built to cover. Researchers call it "self-degradation" in the memory literature: a clear, sustained performance decline after the initial phase, driven by memory inflation and the accumulation of flawed memories. Production engineers call it the reason their retention cohort silently bleeds.

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric

· 11 min read
Tian Pan
Software Engineer

When your LLM feature ships, your monitoring dashboard probably tracks accuracy, latency, and error rate. What it almost certainly does not track is variance — how wildly different the output is each time a user sends the same prompt. That gap is where production AI features quietly collapse.

Variance determines whether your product feels trustworthy or capricious. A feature that scores 88% on your eval suite but delivers a two-sentence answer 40% of the time and a ten-paragraph essay the other 60% will erode user trust faster than one that scores 80% but behaves consistently. Teams optimizing exclusively for accuracy are solving the wrong half of the reliability problem.

The prompt entropy budget is the concept that fills this gap: a structured approach to measuring, budgeting, and controlling the distribution of outputs your model produces over identical inputs — treated the same way you treat p99 latency or error budget in your SLO framework.

Agentic Audit Trails: What Compliance Looks Like When Decisions Are Autonomous

· 12 min read
Tian Pan
Software Engineer

When a human loan officer denies an application, there is a name attached to that decision. That officer received specific information, deliberated, and acted. The reasoning may be imperfect, but it is attributable. There is someone to call, question, and hold accountable.

When an AI agent denies that same application, there is a database row. The row says the decision was made. It does not say why, or what inputs drove it, or which version of the model was running, or whether the system prompt had been quietly updated two weeks prior. When your compliance team hands that row to a regulator, the regulator is not satisfied.

This is the agentic audit trail problem, and most engineering teams building on AI agents have not solved it yet.

Debugging LLM Failures Systematically: A Field Guide for Engineers Who Can't Read Logs

· 12 min read
Tian Pan
Software Engineer

A fintech startup added a single comma to their system prompt. The next day, their invoice generation bot was outputting gibberish and they'd lost $8,500 before anyone traced the cause. No error was thrown. No alert fired. The application kept running, confident and wrong.

This is what debugging LLMs in production actually looks like. There are no stack traces pointing to line numbers. There's no core dump you can inspect. The system doesn't crash — it continues to operate while silently producing degraded output. Traditional debugging instincts don't transfer. Most engineers respond by randomly tweaking prompts until something looks better, deploying based on three examples, and calling it fixed. Then the problem resurfaces two weeks later in a different shape.

There's a better way. LLM failures follow systematic patterns, and those patterns respond to structured investigation. This is the methodology.

Why Gradual Rollouts Don't Work for AI Features (And What to Do Instead)

· 9 min read
Tian Pan
Software Engineer

Canary deployments work because bugs are binary. Code either crashes or it doesn't. You route 1% of traffic to the new version, watch error rates and latency for 30 minutes, and either roll back or proceed. The system grades itself. A bad deploy announces itself loudly.

AI features don't do that. A language model that starts generating subtly wrong advice, outdated recommendations, or plausible-sounding nonsense will produce zero 5xx errors. Latency stays within SLOs. The canary looks green while the product is silently failing its users.

This isn't a tooling problem. It's a conceptual mismatch. The entire mental model behind gradual rollouts — deterministic code, self-grading systems, binary pass/fail — breaks down the moment you introduce a component whose correctness cannot be measured by observing the request itself.

The Operational Model Card: Deployment Documentation Labs Don't Publish

· 11 min read
Tian Pan
Software Engineer

A model card tells you whether a model was red-teamed for CBRN misuse and which demographic groups it underserves. What it doesn't tell you: the p95 TTFT at 10,000 concurrent requests, the accuracy cliff at 80% of the advertised context window, the percentage of complex JSON schemas it malforms, or how much the model's behavior has drifted since the card was published.

The gap is structural, not accidental. Model cards were designed in 2019 for fairness and safety documentation, with civil society organizations and regulators as the intended audience. Engineering teams shipping production systems were not the use case. Seven years of adoption later, that framing is unchanged — while the cost of treating a model card as a deployment specification has never been higher.

The 2025 Foundation Model Transparency Index (Stanford CRFM + Berkeley) confirmed the scope of the omission: OpenAI scored 24/100, Anthropic 32/100, Google 27/100 across 100 transparency indicators. Average scores dropped from 58 to 40 year-over-year, meaning AI transparency is getting worse, not better, as models get more capable. None of the four major labs disclose training data composition, energy usage, or deployment-relevant performance characteristics.

Silent Async Agent Failures: Why Your AI Jobs Die Without Anyone Noticing

· 8 min read
Tian Pan
Software Engineer

Async AI jobs have a problem that traditional background workers don't: they fail silently and confidently. A document processing agent returns HTTP 200, logs a well-formatted result, and moves on — while the actual output is subtly wrong, partially complete, or based on a hallucinated fact three steps back. Your dashboards stay green. Your on-call engineer sleeps through it. Your customers eventually notice.

This is not an edge case. It's the default behavior of async AI systems that haven't been deliberately designed for observability. The tools that keep background job queues reliable in conventional distributed systems — dead letter queues, idempotency keys, saga logs — also work for AI agents. But the failure modes are different enough that they require some translation.