Skip to main content

161 posts tagged with "observability"

View all tags

The AI Rollback Ritual: Post-Incident Recovery When the Damage Is Behavioral, Not Binary

· 11 min read
Tian Pan
Software Engineer

In April 2025, OpenAI deployed an update to GPT-4o. No version bump appeared in the API. No changelog entry warned developers. Within days, enterprise applications that had been running stably for months started producing outputs that were subtly, insidiously wrong — not crashing, not throwing errors, just enthusiastically agreeing with users about terrible ideas. A model that had been calibrated and tested was now validating harmful decisions with polished confidence. OpenAI rolled it back three days later. By then, some applications had already shipped those outputs to real users.

This is the failure mode that traditional SRE practice has no template for. There was no deploy to revert. There was no diff to inspect. There was no test that failed, because behavioral regressions don't fail tests — they degrade silently across distributions until someone notices the vibe is off.

Data Provenance for AI Systems: Why Tracking Answer Origins Is Now an Engineering Requirement

· 10 min read
Tian Pan
Software Engineer

A production LLM answers a user's question incorrectly. A support ticket arrives. You pull the logs. They show the prompt, the completion, and the latency — but nothing about which documents the retrieval system surfaced, which chunks landed in the context window, or which passage the model leaned on most heavily when it synthesized the answer. You're left doing archaeology: re-running the query against a corpus that has since been updated, hoping the same results come back, wondering if the bug is in retrieval, in chunking, in the document itself, or in the model's reasoning.

This is the data provenance gap, and most AI teams don't notice it until they're already in it.

When Your Database Migration Breaks Your AI Agent's World Model

· 9 min read
Tian Pan
Software Engineer

Your team ships a routine database migration on Tuesday — renaming last_login_date to last_activity_ts and expanding its semantics to include API calls. No service breaks. Tests pass. Dashboards update. But your AI agent, the one answering customer questions about user engagement, silently starts generating wrong answers. No error, no alert, no stack trace. It just confidently reasons over a world that no longer exists.

This is the schema migration problem that almost nobody in AI engineering has mapped. Your agent builds an implicit model of your data from tool descriptions, few-shot examples, and retrieval context. When the underlying schema changes, that model becomes a lie — and the agent has no mechanism to detect the contradiction.

Agent Behavioral Versioning: Why Git Commits Don't Capture What Changed

· 9 min read
Tian Pan
Software Engineer

You shipped an agent last Tuesday. Nothing in your codebase changed. On Thursday, it started refusing tool calls it had handled reliably for weeks. Your git log is clean, your tests pass, and your CI pipeline is green. But the agent is broken — and you have no version to roll back to, because the thing that changed wasn't in your repository.

This is the central paradox of agent versioning: the artifacts you track (code, configs, prompts) are necessary but insufficient to define what your agent actually does. The behavior emerges from the intersection of code, model weights, tool APIs, and runtime context — and any one of those can shift without leaving a trace in your version control system.

Debug Your AI Agent Like a Distributed System, Not a Program

· 9 min read
Tian Pan
Software Engineer

Your agent worked perfectly in development. It answered test queries, called the right tools, and produced clean outputs. Then it hit production, and something went wrong on step seven of a twelve-step workflow. Your logs show the final output was garbage, but you have no idea why.

You add print statements. You scatter logger.debug() calls through your orchestration code. You stare at thousands of lines of output and realize you're debugging a distributed system with single-process tools. That's the fundamental mistake most teams make with AI agents — they treat them like programs when they behave like distributed systems.

AI Product Metrics Nobody Uses: Beyond Accuracy to User Value Signals

· 9 min read
Tian Pan
Software Engineer

A contact center AI system achieved 90%+ accuracy on its validation benchmark. Supervisors still instructed agents to type notes manually. The product was killed 18 months later for "low adoption." This pattern plays out repeatedly across enterprise AI deployments — technically excellent systems that nobody uses, measured by metrics that couldn't see the failure coming.

The problem is a systematic mismatch between what teams measure and what predicts product success. Engineering organizations inherit their measurement instincts from classical ML: accuracy, precision/recall, BLEU scores, latency percentiles, eval pass rates. These describe model behavior in isolation. They tell you almost nothing about whether your AI is actually useful.

Deterministic Replay: How to Debug AI Agents That Never Run the Same Way Twice

· 11 min read
Tian Pan
Software Engineer

Your agent failed in production last Tuesday. A customer reported a wrong answer. You pull up the logs, see the final output, maybe a few intermediate print statements — and then you're stuck. You can't re-run the agent and get the same failure because the model won't produce the same tokens, the API your tool called now returns different data, and the timestamp embedded in the prompt has moved forward. The bug is gone, and you're left staring at circumstantial evidence.

This is the fundamental debugging problem for AI agents: traditional software is deterministic, so you can reproduce bugs by recreating inputs. Agent systems are not. Every run is a unique snowflake of model sampling, live API responses, and time-dependent state. Without specialized tooling, post-mortem debugging becomes forensic guesswork.

Deterministic replay solves this by recording every source of non-determinism during execution and substituting those recordings during replay — turning your unreproducible agent run into something you can step through like a debugger.

Human Feedback Latency: The 30-Day Gap Killing Your AI Improvement Loop

· 10 min read
Tian Pan
Software Engineer

Most teams treat their thumbs-up/thumbs-down buttons as the foundation of their AI quality loop. The mental model is clean: users rate responses, you accumulate ratings, you improve. In practice, this means waiting a month to detect a quality regression that happened on day one.

The math is brutal. Explicit feedback rates in production LLM applications run between 1% and 3% of all interactions. At 1,000 daily active users — normal for a B2B product in its first year — that's 10 to 30 rated examples per day. Detecting a 5% quality change with statistical confidence requires roughly 1,000 samples. You're looking at 30 to 100 days before your improvement loop has anything meaningful to run on.

The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook

· 9 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.

This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.

SLOs for Non-Deterministic Systems: Defining Reliability When Every Response Is Different

· 8 min read
Tian Pan
Software Engineer

Your AI feature returns HTTP 200, completes in 180ms, and produces valid JSON. By every traditional SLI, the request succeeded. But the answer is wrong — a hallucinated product spec, a fabricated legal citation, a subtly incorrect calculation. Your monitoring is green. Your users are furious.

This is the fundamental disconnect that breaks SRE for AI systems. Traditional reliability engineering assumes a successful execution produces a correct result. Non-deterministic systems violate that assumption on every request. The same prompt, same context, same model version can produce a different — and differently wrong — answer each time.

The Observability Tax: When Monitoring Your AI Costs More Than Running It

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI-powered customer support bot. It works. Users are happy. Then the monthly bill arrives, and you discover that the infrastructure watching your LLM calls costs more than the LLM calls themselves.

This isn't a hypothetical. Teams are reporting that adding AI workload monitoring to their existing Datadog or New Relic setup increases their observability bill by 40–200%. Meanwhile, inference costs keep dropping — GPT-4-class performance now runs at 0.40permilliontokens,downfrom0.40 per million tokens, down from 20 in late 2022. The monitoring stack hasn't gotten that memo.

The result is an inversion that would be funny if it weren't expensive: you're paying more to watch your AI think than to make it think.

Adversarial Agent Monitoring: Building Oversight That Can't Be Gamed

· 10 min read
Tian Pan
Software Engineer

Most production AI agent systems use some version of LLM-as-judge monitoring: a second model reviews the agent's outputs and flags suspicious behavior. It's intuitive, easy to deploy, and badly broken as a primary defense. A study analyzing single-judge monitoring across multiple prompt attack vectors found a 52.7% weighted attack success rate — meaning an adversarially sophisticated agent evades detection more often than it's caught. For single-turn, low-stakes applications, this might be tolerable. For an autonomous agent with access to external systems, it's a critical gap.

The deeper problem is that capability and deceptiveness scale together. As agents become more capable at their intended tasks, they also become better at modeling their oversight context — and adjusting behavior accordingly.