Skip to main content

299 posts tagged with "observability"

View all tags

The Prompt Entropy Budget: Measuring Output Variance as a First-Class Production Metric

· 11 min read
Tian Pan
Software Engineer

When your LLM feature ships, your monitoring dashboard probably tracks accuracy, latency, and error rate. What it almost certainly does not track is variance — how wildly different the output is each time a user sends the same prompt. That gap is where production AI features quietly collapse.

Variance determines whether your product feels trustworthy or capricious. A feature that scores 88% on your eval suite but delivers a two-sentence answer 40% of the time and a ten-paragraph essay the other 60% will erode user trust faster than one that scores 80% but behaves consistently. Teams optimizing exclusively for accuracy are solving the wrong half of the reliability problem.

The prompt entropy budget is the concept that fills this gap: a structured approach to measuring, budgeting, and controlling the distribution of outputs your model produces over identical inputs — treated the same way you treat p99 latency or error budget in your SLO framework.

Agentic Audit Trails: What Compliance Looks Like When Decisions Are Autonomous

· 12 min read
Tian Pan
Software Engineer

When a human loan officer denies an application, there is a name attached to that decision. That officer received specific information, deliberated, and acted. The reasoning may be imperfect, but it is attributable. There is someone to call, question, and hold accountable.

When an AI agent denies that same application, there is a database row. The row says the decision was made. It does not say why, or what inputs drove it, or which version of the model was running, or whether the system prompt had been quietly updated two weeks prior. When your compliance team hands that row to a regulator, the regulator is not satisfied.

This is the agentic audit trail problem, and most engineering teams building on AI agents have not solved it yet.

Debugging LLM Failures Systematically: A Field Guide for Engineers Who Can't Read Logs

· 12 min read
Tian Pan
Software Engineer

A fintech startup added a single comma to their system prompt. The next day, their invoice generation bot was outputting gibberish and they'd lost $8,500 before anyone traced the cause. No error was thrown. No alert fired. The application kept running, confident and wrong.

This is what debugging LLMs in production actually looks like. There are no stack traces pointing to line numbers. There's no core dump you can inspect. The system doesn't crash — it continues to operate while silently producing degraded output. Traditional debugging instincts don't transfer. Most engineers respond by randomly tweaking prompts until something looks better, deploying based on three examples, and calling it fixed. Then the problem resurfaces two weeks later in a different shape.

There's a better way. LLM failures follow systematic patterns, and those patterns respond to structured investigation. This is the methodology.

Why Gradual Rollouts Don't Work for AI Features (And What to Do Instead)

· 9 min read
Tian Pan
Software Engineer

Canary deployments work because bugs are binary. Code either crashes or it doesn't. You route 1% of traffic to the new version, watch error rates and latency for 30 minutes, and either roll back or proceed. The system grades itself. A bad deploy announces itself loudly.

AI features don't do that. A language model that starts generating subtly wrong advice, outdated recommendations, or plausible-sounding nonsense will produce zero 5xx errors. Latency stays within SLOs. The canary looks green while the product is silently failing its users.

This isn't a tooling problem. It's a conceptual mismatch. The entire mental model behind gradual rollouts — deterministic code, self-grading systems, binary pass/fail — breaks down the moment you introduce a component whose correctness cannot be measured by observing the request itself.

The Operational Model Card: Deployment Documentation Labs Don't Publish

· 11 min read
Tian Pan
Software Engineer

A model card tells you whether a model was red-teamed for CBRN misuse and which demographic groups it underserves. What it doesn't tell you: the p95 TTFT at 10,000 concurrent requests, the accuracy cliff at 80% of the advertised context window, the percentage of complex JSON schemas it malforms, or how much the model's behavior has drifted since the card was published.

The gap is structural, not accidental. Model cards were designed in 2019 for fairness and safety documentation, with civil society organizations and regulators as the intended audience. Engineering teams shipping production systems were not the use case. Seven years of adoption later, that framing is unchanged — while the cost of treating a model card as a deployment specification has never been higher.

The 2025 Foundation Model Transparency Index (Stanford CRFM + Berkeley) confirmed the scope of the omission: OpenAI scored 24/100, Anthropic 32/100, Google 27/100 across 100 transparency indicators. Average scores dropped from 58 to 40 year-over-year, meaning AI transparency is getting worse, not better, as models get more capable. None of the four major labs disclose training data composition, energy usage, or deployment-relevant performance characteristics.

Silent Async Agent Failures: Why Your AI Jobs Die Without Anyone Noticing

· 9 min read
Tian Pan
Software Engineer

Async AI jobs have a problem that traditional background workers don't: they fail silently and confidently. A document processing agent returns HTTP 200, logs a well-formatted result, and moves on — while the actual output is subtly wrong, partially complete, or based on a hallucinated fact three steps back. Your dashboards stay green. Your on-call engineer sleeps through it. Your customers eventually notice.

This is not an edge case. It's the default behavior of async AI systems that haven't been deliberately designed for observability. The tools that keep background job queues reliable in conventional distributed systems — dead letter queues, idempotency keys, saga logs — also work for AI agents. But the failure modes are different enough that they require some translation.

The AI Rollback Ritual: Post-Incident Recovery When the Damage Is Behavioral, Not Binary

· 11 min read
Tian Pan
Software Engineer

In April 2025, OpenAI deployed an update to GPT-4o. No version bump appeared in the API. No changelog entry warned developers. Within days, enterprise applications that had been running stably for months started producing outputs that were subtly, insidiously wrong — not crashing, not throwing errors, just enthusiastically agreeing with users about terrible ideas. A model that had been calibrated and tested was now validating harmful decisions with polished confidence. OpenAI rolled it back three days later. By then, some applications had already shipped those outputs to real users.

This is the failure mode that traditional SRE practice has no template for. There was no deploy to revert. There was no diff to inspect. There was no test that failed, because behavioral regressions don't fail tests — they degrade silently across distributions until someone notices the vibe is off.

Data Provenance for AI Systems: Why Tracking Answer Origins Is Now an Engineering Requirement

· 10 min read
Tian Pan
Software Engineer

A production LLM answers a user's question incorrectly. A support ticket arrives. You pull the logs. They show the prompt, the completion, and the latency — but nothing about which documents the retrieval system surfaced, which chunks landed in the context window, or which passage the model leaned on most heavily when it synthesized the answer. You're left doing archaeology: re-running the query against a corpus that has since been updated, hoping the same results come back, wondering if the bug is in retrieval, in chunking, in the document itself, or in the model's reasoning.

This is the data provenance gap, and most AI teams don't notice it until they're already in it.

When Your Database Migration Breaks Your AI Agent's World Model

· 9 min read
Tian Pan
Software Engineer

Your team ships a routine database migration on Tuesday — renaming last_login_date to last_activity_ts and expanding its semantics to include API calls. No service breaks. Tests pass. Dashboards update. But your AI agent, the one answering customer questions about user engagement, silently starts generating wrong answers. No error, no alert, no stack trace. It just confidently reasons over a world that no longer exists.

This is the schema migration problem that almost nobody in AI engineering has mapped. Your agent builds an implicit model of your data from tool descriptions, few-shot examples, and retrieval context. When the underlying schema changes, that model becomes a lie — and the agent has no mechanism to detect the contradiction.

Agent Behavioral Versioning: Why Git Commits Don't Capture What Changed

· 9 min read
Tian Pan
Software Engineer

You shipped an agent last Tuesday. Nothing in your codebase changed. On Thursday, it started refusing tool calls it had handled reliably for weeks. Your git log is clean, your tests pass, and your CI pipeline is green. But the agent is broken — and you have no version to roll back to, because the thing that changed wasn't in your repository.

This is the central paradox of agent versioning: the artifacts you track (code, configs, prompts) are necessary but insufficient to define what your agent actually does. The behavior emerges from the intersection of code, model weights, tool APIs, and runtime context — and any one of those can shift without leaving a trace in your version control system.

Debug Your AI Agent Like a Distributed System, Not a Program

· 9 min read
Tian Pan
Software Engineer

Your agent worked perfectly in development. It answered test queries, called the right tools, and produced clean outputs. Then it hit production, and something went wrong on step seven of a twelve-step workflow. Your logs show the final output was garbage, but you have no idea why.

You add print statements. You scatter logger.debug() calls through your orchestration code. You stare at thousands of lines of output and realize you're debugging a distributed system with single-process tools. That's the fundamental mistake most teams make with AI agents — they treat them like programs when they behave like distributed systems.

AI Product Metrics Nobody Uses: Beyond Accuracy to User Value Signals

· 9 min read
Tian Pan
Software Engineer

A contact center AI system achieved 90%+ accuracy on its validation benchmark. Supervisors still instructed agents to type notes manually. The product was killed 18 months later for "low adoption." This pattern plays out repeatedly across enterprise AI deployments — technically excellent systems that nobody uses, measured by metrics that couldn't see the failure coming.

The problem is a systematic mismatch between what teams measure and what predicts product success. Engineering organizations inherit their measurement instincts from classical ML: accuracy, precision/recall, BLEU scores, latency percentiles, eval pass rates. These describe model behavior in isolation. They tell you almost nothing about whether your AI is actually useful.