Skip to main content

86 posts tagged with "observability"

View all tags

Production AI Incident Response: When Your Agent Goes Wrong at 3am

· 11 min read
Tian Pan
Software Engineer

A multi-agent cost-tracking system at a fintech startup ran undetected for eleven days before anyone noticed. The cause: Agent A asked Agent B for clarification. Agent B asked Agent A for help interpreting the response. Neither had logic to break the loop. The $127 weekly bill became $47,000 before a human looked at the invoice.

No errors were thrown. No alarms fired. Latency was normal. The system was running exactly as designed—just running forever.

This is what AI incidents actually look like. They're not stack traces and 500 errors. They're silent behavioral failures, runaway loops, and plausible wrong answers delivered at production scale with full confidence. Your existing incident runbook almost certainly doesn't cover any of them.

The Semantic Failure Mode: When Your AI Runs Perfectly and Does the Wrong Thing

· 9 min read
Tian Pan
Software Engineer

Your AI agent completes the task. No errors in the logs. Latency looks normal. The output is well-formatted JSON, grammatically perfect prose, or a valid SQL query that executes without complaint. Every dashboard is green.

And the user stares at the result, sighs, and starts over from scratch.

This is the semantic failure mode — the class of production AI failures where the system runs correctly, the model responds confidently, and the output is delivered on time, but the agent didn't do what the user actually needed. Traditional error monitoring is completely blind to these failures because there is no error. The HTTP status is 200. The model didn't refuse. The output conforms to the schema. By every technical metric, the system succeeded.

The Model Upgrade Trap: How Foundation Model Updates Silently Break Production Systems

· 9 min read
Tian Pan
Software Engineer

Your production system is running fine. Uptime is 99.9%. Latency is nominal. Zero error-rate alerts. Then a user files a ticket: "The summaries have been weirdly off lately." You pull logs. Nothing looks wrong. You check the model version — same one you deployed three months ago. What changed?

The model provider did. Silently.

This is the model upgrade trap: foundation models change beneath you without announcement, and standard observability infrastructure is completely blind to the behavioral drift. By the time users notice, the degradation has been compounding for weeks.

What Your APM Dashboard Won't Tell You: LLM Observability in Production

· 10 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows 99.4% uptime, sub-500ms P95 latency, and a 0.1% error rate. Everything is green. Meanwhile, your support queue is filling with users complaining the AI gave them completely wrong answers. You have no idea why, because every request returned HTTP 200.

This is the fundamental difference between traditional observability and what you actually need for LLM systems. A language model can fail in ways that leave no trace in standard APM tooling: hallucinating facts, retrieving documents from the wrong product version, ignoring the system prompt after a code change modified it, or silently degrading on a specific query type after a model update. All of these look fine on your latency graph.

LLM Observability in Production: The Four Silent Failures Engineers Miss

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications to production have a logging setup they mistake for observability. They store prompts and responses in a database, track token counts in a spreadsheet, and set up latency alerts in Datadog. Then a user reports the chatbot gave wrong answers for two days, and nobody can tell you why — because none of the data collected tells you whether the model was actually right.

Traditional monitoring answers "is the system up and how fast is it?" LLM observability answers a harder question: "is the system doing what it's supposed to do, and when did it stop?" That distinction matters enormously when your system's behavior is probabilistic, context-dependent, and often wrong in ways that don't trigger any alert.

The Eval-to-Production Gap: Why 92% on Your Test Suite Means 40% User Satisfaction

· 10 min read
Tian Pan
Software Engineer

Your team spent three weeks building a rigorous eval suite. It covers edge cases. It includes adversarial examples. The LLM-as-judge scores 92% across all dimensions. You ship.

Then the support tickets start. Users say the AI "doesn't understand what they're asking." Session abandonment is up 30%. Satisfaction scores come back at 41%.

This gap — between eval performance and real-world outcomes — is the most common failure mode in production AI systems today. It's not a model problem. It's a measurement problem.

Building AI Agents That Actually Work in Production

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents make the same mistake: they architect for sophistication before they have evidence that sophistication is needed. A production analysis of 47 agent deployments found that 68% would have achieved equivalent or better outcomes with a well-designed single-agent system. The multi-agent tax — higher latency, compounding failure modes, operational complexity — often eats the gains before they reach users.

This isn't an argument against agents. It's an argument for building them the same way you'd build any serious production system: start with the simplest thing that works, instrument everything, and add complexity only when the simpler version demonstrably fails.

Mastering AI Agent Observability: Why Your Dashboards Are Lying to You

· 9 min read
Tian Pan
Software Engineer

Your agent is returning HTTP 200s. Latency is within SLA. Error rates are flat. Everything on the dashboard looks green — and your users are getting confidently wrong answers.

This is the core observability gap in AI systems: the metrics that traditionally signal system health are almost entirely irrelevant to whether your agent is actually doing its job. An agent can fluently hallucinate, skip required tools, use stale retrieval results, or reason itself into logical contradictions — all while your monitoring shows zero anomalies. The standard playbook for service observability doesn't transfer to agentic systems, and teams that don't understand this gap ship agents they can't trust, debug, or improve.

Systematic Debugging for AI Agents: From Guesswork to Root Cause

· 9 min read
Tian Pan
Software Engineer

When an AI agent fails in production, you rarely know exactly when it went wrong. You see the final output — a hallucinated answer, a skipped step, a tool called with the wrong arguments — but the actual failure could have happened three steps earlier. This is the core debugging problem that software engineering hasn't solved yet: agents execute as a sequence of decisions, and by the time you notice something is wrong, the evidence is buried in a long trace of interleaved LLM calls, tool invocations, and state mutations.

Traditional debugging assumes determinism. You can reproduce the bug, set a breakpoint, inspect the state. Agent debugging breaks all three of those assumptions simultaneously. The same input can produce different execution paths. Reproducing a failure requires capturing the exact context, model temperature, and external state at the moment it happened. And "setting a breakpoint" in a live reasoning loop is not something most agent frameworks even support.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Building a Generative AI Platform: Architecture, Trade-offs, and the Components That Actually Matter

· 12 min read
Tian Pan
Software Engineer

Most teams treating their GenAI stack as a model integration project eventually discover they've actually built—or need to build—a platform. The model is the easy part. The hard part is everything around it: routing queries to the right model, retrieving context reliably, filtering unsafe outputs, caching redundant calls, tracing what went wrong in a chain of five LLM calls, and keeping costs from tripling month-over-month as usage scales.

This article is about that platform layer. Not the model weights, not the prompts—the surrounding infrastructure that separates a working proof of concept from something you'd trust to serve a million users.

LLM Observability in Production: Tracing What You Can't Predict

· 10 min read
Tian Pan
Software Engineer

Your monitoring stack tells you everything about request rates, CPU, and database latency. It tells you almost nothing about whether your LLM just hallucinated a refund policy, why a customer-facing agent looped through three tool calls to answer a simple question, or which feature in your product is quietly burning $800 a day in tokens.

Traditional observability was built around deterministic systems. LLMs are structurally different — same input, different output, every time. The failure mode isn't a 500 error or a timeout; it's a confident, plausible-sounding answer that happens to be wrong. The cost isn't steady and predictable; it spikes when a single misconfigured prompt hits a traffic wave. Debugging isn't "find the exception in the stack trace"; it's "reconstruct why the agent chose this tool path at 2 AM on Tuesday."

This is the problem LLM observability solves — and the discipline has matured significantly over the past 18 months.