SRE for AI Agents: What Actually Breaks at 3am

April 19, 2026 · 10 min read

Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

Traditional operations practice is built around a core assumption: failures produce signals. Services crash, timeouts fire, error rates spike. You write alerts that catch those signals, runbooks that explain what they mean, and playbooks that tell oncall how to respond. The system works because broken code tends to announce itself.

AI agents break this assumption. The failure mode is not a crash — it is a system that runs correctly while doing the wrong thing. The LLM responds, the tool calls execute, the loop continues. From an infrastructure perspective, everything is healthy. From a business perspective, you are burning money and accomplishing nothing. The gap between those two perspectives is where agent incidents live.

The Failure Modes That Traditional Monitoring Misses

Agent failures cluster into four categories, each invisible to standard error-rate monitoring.

Infinite tool loops are the most expensive. An agent assigned a goal with no completion criteria will continue issuing tool calls indefinitely. The variant that causes the most damage is the soft loop: the agent is not calling the exact same tool repeatedly (which deduplication would catch), but rather re-phrasing the same query, adding one word to a search, or shifting a parameter slightly, while making no actual progress toward its goal. Each iteration consumes tokens. The context window grows. Cost per call increases as the window fills. Teams typically discover these through billing data, not monitoring.

Context overflow in long-running workflows appears when a multi-step task accumulates tool outputs, prior reasoning, and conversation history until the context window fills. Unlike a memory allocation failure, this does not crash the process — the agent simply starts losing track of earlier steps, producing increasingly incoherent reasoning while continuing to execute. One documented materials science workflow consumed 20 million tokens before reaching this state; re-implementing it with memory pointers (short identifiers replacing full data) reduced token usage by over 99%.

Hallucinated API calls occur when an agent constructs a request that looks syntactically valid but uses parameters that do not exist in the actual API schema. The specific failure mode that causes runaway cost is this: the agent cannot distinguish "the API rejected my request" from "the task is impossible," so it retries the same malformed call, with a growing context window, hundreds of times. One reported incident cost $2,000 from 847 retries of the same failed call.

Schema drift and credential rot are the most common silent production failures. When a tool's API schema changes — parameter names renamed, required fields added, authentication flows updated — the agent continues operating against its cached understanding of the schema. Calls fail with validation errors the agent does not interpret correctly, leading to confused behavior rather than clean error escalation. Certificate expirations discovered only after months-long silent renewal failures are another variant of this pattern.

The compound math is brutal. An agent with 85% per-step accuracy achieves roughly 20% end-to-end success on a ten-step workflow. That number assumes failures produce clean signals. When failures produce silent loops instead, the effective success rate looks fine while cost and time spiral.

What Your SRE Stack Is Not Measuring

Most observability stacks track availability, latency, and error rates. These metrics are necessary but insufficient for agents. A healthy-looking system can simultaneously be:

Burning 10x the expected token budget on a single session
Making no progress on the user's actual goal
Running a tool loop that has been active for six hours
Executing against a schema that has been deprecated for three months

The observability gap is structural. Token cost is an asynchronous metric — by the time a budget alert fires on a daily spend threshold, additional API cycles have already completed. Progress toward goal is a semantic metric — it does not appear in HTTP status codes or database write counts. Tool argument similarity is a per-call behavioral metric that requires instrumentation most teams have not built.

The industry is converging on OpenTelemetry as the standard framework for agent tracing. The key insight driving adoption is the distinction between infrastructure observability (is the service running?) and behavioral observability (is the agent doing the right thing?). You need both.

What behavioral observability requires:

Per-step token consumption tracked against per-session budgets, not daily aggregates
Tool invocation history with argument deduplication to detect soft loops
Step count toward goal with a halt trigger after N steps without measurable state change
Context window utilization tracked as a percentage, with alerts at 70% capacity
Halt reason codes that distinguish timeout, max_steps, cost ceiling, and error from intentional completion

A three-tier tracing architecture — trace level for full user interactions, span level for individual LLM calls and tool invocations, attribute level for structured metadata — gives you the granularity to debug agent incidents without reconstructing behavior from logs after the fact.

Cost Circuit Breakers: Why Alerts Are Not Enforcement

The $47,000 incident is instructive precisely because the team was not negligent. They had billing alerts. The alerts worked as designed. The problem is that budget alerts report what already happened; they do not prevent the next API call from happening.

Effective cost control for agents requires synchronous enforcement at the API call level. Before each LLM request or tool invocation, the execution layer checks the session's remaining budget. If the budget is exhausted, the call does not happen. This is a fundamentally different architecture from alerting on aggregate spend.

The practical implementation sits at the proxy or middleware layer, independent of agent framework code. Keeping enforcement out of agent prompts is not optional — any cost limit expressed in a prompt is a soft constraint that the model can reason around, override in edge cases, or simply ignore under adversarial input. Enforcement belongs in code that runs before the model sees the request.

A reasonable default budget structure by agent type:

Research agents (unbounded exploration tasks): $5 per session
Operational agents (specific task execution): $0.50 per session

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

SRE for AI Agents: What Actually Breaks at 3am

The Failure Modes That Traditional Monitoring Misses

What Your SRE Stack Is Not Measuring

Cost Circuit Breakers: Why Alerts Are Not Enforcement

Recommended Reading

About Tian Pan

The Failure Modes That Traditional Monitoring Misses​

What Your SRE Stack Is Not Measuring​

Cost Circuit Breakers: Why Alerts Are Not Enforcement​

Recommended Reading

About Tian Pan

The Failure Modes That Traditional Monitoring Misses

What Your SRE Stack Is Not Measuring

Cost Circuit Breakers: Why Alerts Are Not Enforcement