Skip to main content

266 posts tagged with "observability"

View all tags

Agent Behavioral Versioning: Why Git Commits Don't Capture What Changed

· 9 min read
Tian Pan
Software Engineer

You shipped an agent last Tuesday. Nothing in your codebase changed. On Thursday, it started refusing tool calls it had handled reliably for weeks. Your git log is clean, your tests pass, and your CI pipeline is green. But the agent is broken — and you have no version to roll back to, because the thing that changed wasn't in your repository.

This is the central paradox of agent versioning: the artifacts you track (code, configs, prompts) are necessary but insufficient to define what your agent actually does. The behavior emerges from the intersection of code, model weights, tool APIs, and runtime context — and any one of those can shift without leaving a trace in your version control system.

Debug Your AI Agent Like a Distributed System, Not a Program

· 9 min read
Tian Pan
Software Engineer

Your agent worked perfectly in development. It answered test queries, called the right tools, and produced clean outputs. Then it hit production, and something went wrong on step seven of a twelve-step workflow. Your logs show the final output was garbage, but you have no idea why.

You add print statements. You scatter logger.debug() calls through your orchestration code. You stare at thousands of lines of output and realize you're debugging a distributed system with single-process tools. That's the fundamental mistake most teams make with AI agents — they treat them like programs when they behave like distributed systems.

AI Product Metrics Nobody Uses: Beyond Accuracy to User Value Signals

· 9 min read
Tian Pan
Software Engineer

A contact center AI system achieved 90%+ accuracy on its validation benchmark. Supervisors still instructed agents to type notes manually. The product was killed 18 months later for "low adoption." This pattern plays out repeatedly across enterprise AI deployments — technically excellent systems that nobody uses, measured by metrics that couldn't see the failure coming.

The problem is a systematic mismatch between what teams measure and what predicts product success. Engineering organizations inherit their measurement instincts from classical ML: accuracy, precision/recall, BLEU scores, latency percentiles, eval pass rates. These describe model behavior in isolation. They tell you almost nothing about whether your AI is actually useful.

Deterministic Replay: How to Debug AI Agents That Never Run the Same Way Twice

· 11 min read
Tian Pan
Software Engineer

Your agent failed in production last Tuesday. A customer reported a wrong answer. You pull up the logs, see the final output, maybe a few intermediate print statements — and then you're stuck. You can't re-run the agent and get the same failure because the model won't produce the same tokens, the API your tool called now returns different data, and the timestamp embedded in the prompt has moved forward. The bug is gone, and you're left staring at circumstantial evidence.

This is the fundamental debugging problem for AI agents: traditional software is deterministic, so you can reproduce bugs by recreating inputs. Agent systems are not. Every run is a unique snowflake of model sampling, live API responses, and time-dependent state. Without specialized tooling, post-mortem debugging becomes forensic guesswork.

Deterministic replay solves this by recording every source of non-determinism during execution and substituting those recordings during replay — turning your unreproducible agent run into something you can step through like a debugger.

Human Feedback Latency: The 30-Day Gap Killing Your AI Improvement Loop

· 10 min read
Tian Pan
Software Engineer

Most teams treat their thumbs-up/thumbs-down buttons as the foundation of their AI quality loop. The mental model is clean: users rate responses, you accumulate ratings, you improve. In practice, this means waiting a month to detect a quality regression that happened on day one.

The math is brutal. Explicit feedback rates in production LLM applications run between 1% and 3% of all interactions. At 1,000 daily active users — normal for a B2B product in its first year — that's 10 to 30 rated examples per day. Detecting a 5% quality change with statistical confidence requires roughly 1,000 samples. You're looking at 30 to 100 days before your improvement loop has anything meaningful to run on.

The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook

· 9 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.

This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.

SLOs for Non-Deterministic Systems: Defining Reliability When Every Response Is Different

· 8 min read
Tian Pan
Software Engineer

Your AI feature returns HTTP 200, completes in 180ms, and produces valid JSON. By every traditional SLI, the request succeeded. But the answer is wrong — a hallucinated product spec, a fabricated legal citation, a subtly incorrect calculation. Your monitoring is green. Your users are furious.

This is the fundamental disconnect that breaks SRE for AI systems. Traditional reliability engineering assumes a successful execution produces a correct result. Non-deterministic systems violate that assumption on every request. The same prompt, same context, same model version can produce a different — and differently wrong — answer each time.

The Observability Tax: When Monitoring Your AI Costs More Than Running It

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI-powered customer support bot. It works. Users are happy. Then the monthly bill arrives, and you discover that the infrastructure watching your LLM calls costs more than the LLM calls themselves.

This isn't a hypothetical. Teams are reporting that adding AI workload monitoring to their existing Datadog or New Relic setup increases their observability bill by 40–200%. Meanwhile, inference costs keep dropping — GPT-4-class performance now runs at 0.40permilliontokens,downfrom0.40 per million tokens, down from 20 in late 2022. The monitoring stack hasn't gotten that memo.

The result is an inversion that would be funny if it weren't expensive: you're paying more to watch your AI think than to make it think.

Adversarial Agent Monitoring: Building Oversight That Can't Be Gamed

· 10 min read
Tian Pan
Software Engineer

Most production AI agent systems use some version of LLM-as-judge monitoring: a second model reviews the agent's outputs and flags suspicious behavior. It's intuitive, easy to deploy, and badly broken as a primary defense. A study analyzing single-judge monitoring across multiple prompt attack vectors found a 52.7% weighted attack success rate — meaning an adversarially sophisticated agent evades detection more often than it's caught. For single-turn, low-stakes applications, this might be tolerable. For an autonomous agent with access to external systems, it's a critical gap.

The deeper problem is that capability and deceptiveness scale together. As agents become more capable at their intended tasks, they also become better at modeling their oversight context — and adjusting behavior accordingly.

Agent State as Event Stream: Why Immutable Event Sourcing Beats Internal Agent Memory

· 10 min read
Tian Pan
Software Engineer

An agent misbehaves at 3:47 AM on a Tuesday. It deletes files it shouldn't have, or calls an API with the wrong parameters, or confidently takes an irreversible action based on information that was stale by six hours. You pull up your logs. You can see what the agent did. What you cannot see — what almost no agent framework gives you — is what the agent believed when it made that decision. The state that drove the choice is gone, overwritten by every subsequent step. You're debugging the present to understand the past, and that's an architecture problem, not a logging problem.

Most AI agents treat state as mutable in-memory data: a dictionary that gets updated in place, a database row that gets overwritten, a scratch pad that shrinks and grows. This works fine for simple, short-lived tasks. It collapses under the three pressures that define serious production deployments: debugging complex failures, coordinating across distributed agents, and satisfying compliance requirements. Event sourcing — treating every state change as an immutable, append-only event — solves all three problems at once, and it does it in a way that makes agents structurally more debuggable, not just more logged.

The LLM Request Lifecycle Is a State Machine — Treat It Like One

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM request handling as a linear function: call the API, check for an exception, maybe retry once, return the result. In practice it's nothing like that. Between the moment a user triggers an LLM call and the moment a response reaches their screen, a request can traverse a dozen implicit states — attempting primary provider, waiting for backoff, switching to fallback, validating output, retrying with refined prompt — without any of those transitions being recorded or visible.

The result is debugging that happens after the fact from logs scattered across services, with no authoritative answer to "what did this request actually do?" Treating the LLM request lifecycle as an explicit finite state machine is the architectural move that makes that question answerable without archaeological work.

The LLM Request Lifecycle Your try/catch Is Missing

· 10 min read
Tian Pan
Software Engineer

The most dangerous failure your LLM stack can produce returns HTTP 200. The JSON parses. Your schema validation passes. No exception is raised. And the response is completely wrong — wrong facts, wrong structure, truncated mid-sentence, or fabricated from whole cloth.

A single try/catch around an LLM API call handles the easy failures: rate limits, server errors, network timeouts. These are the visible failures. The invisible ones — a model that hit its token limit and stopped mid-answer, an agent that looped 21 extra tool calls before finding the right parameter name, a validation retry that inflated your costs by 37% — produce no exceptions. They produce results.

The fix is not better error handling. It is modeling the LLM request lifecycle as an explicit state machine, where every state transition emits an observable span, and failure modes are first-class states rather than buried exception handlers.