Last week I watched our customer support agent confidently tell a user their subscription was active. HTTP 200. Latency: 340ms. Every dashboard was green. The subscription had been canceled three weeks ago.
This is the failure mode nobody talks about at standup: your agent succeeds technically while failing semantically. No error code. No timeout. No stack trace. Just a perfectly formatted JSON response with completely wrong content.
The Invisible Failure Category
Traditional monitoring answers: “Did it respond? How fast? Did it crash?” AI agents create a new question: “Was the response correct?”
And that question is brutally hard to answer at scale.
I started digging into this after our third silent failure in a month. Here is what I found:
- 57% of organizations now run AI agents in production, yet observability is consistently rated the weakest part of the AI stack (Deepak Gupta market report)
- Agents can fail silently by misrouting tickets, skipping steps, or looping endlessly—failures that only surface when users complain (Arize research)
- Hallucination rates range from 3% on summarization to over 90% on specialized tasks, depending on the model and domain (Braintrust buyer guide)
Your Datadog setup catches none of this.
What “Observability” Means Now
The old stack: logs, metrics, traces. Request in, response out, measure the middle.
The new stack needs entirely different primitives:
1. Traces that follow reasoning, not just requests. An agent making 10-50+ decisions per interaction needs trace-level visibility into tool selection rationale, context propagation, and reasoning chain progression. OpenTelemetry is adding vendor-neutral instrumentation for this, but adoption is early.
2. Continuous evaluation, not just pre-deployment evals. You need canary queries running against production. Track output consistency. Monitor semantic similarity between current outputs and a golden dataset. If your eval suite only runs in CI/CD, you are flying blind in production.
3. Session coherence across multi-turn interactions. An agent that answers correctly in isolation but contradicts itself across a conversation is broken in a way no single-request metric captures.
4. Quality scoring alongside operational metrics. Confidence scores, response entropy, semantic drift—these need to live next to your p99 latency dashboard, not in a separate “AI team” tool.
The Market Response
Capital is flowing fast. Braintrust just raised $80M Series B at an $800M valuation. Arize has $131M in total funding. Langfuse got acquired by ClickHouse as part of a $400M round.
But here is the uncomfortable question: most engineering teams I talk to have zero budget allocated for AI observability. They bolted agents onto existing infrastructure and assumed their current monitoring stack would catch problems. It does not.
The gap between “we deployed agents” and “we can tell when agents are wrong” is enormous—and it is where users lose trust.
What I Am Actually Doing About It
We are building a lightweight eval layer that runs in production:
- Golden query sets that execute hourly against our agents
- Semantic similarity scoring against expected outputs
- Automated alerting when output quality drifts below thresholds
- Trace capture for every agent decision chain (not just the final response)
Total cost so far: about 3 weeks of engineering time and ~$200/month in eval compute.
It is not elegant. But it caught two silent failures last week that would have taken days to surface through user complaints.
The Question I Cannot Answer
Are we over-engineering this? Is a lightweight eval layer enough, or do teams need full-blown observability platforms like Arize or Braintrust? At what agent count does DIY stop scaling?
For the teams running agents in production: how are you catching the HTTP 200 failures? Are you building custom eval pipelines, buying platforms, or just hoping users report problems fast enough?
Genuinely curious what the range of approaches looks like out there.