SRE for AI Agents: What Actually Breaks at 3am
A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.
This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.
Traditional operations practice is built around a core assumption: failures produce signals. Services crash, timeouts fire, error rates spike. You write alerts that catch those signals, runbooks that explain what they mean, and playbooks that tell oncall how to respond. The system works because broken code tends to announce itself.
AI agents break this assumption. The failure mode is not a crash — it is a system that runs correctly while doing the wrong thing. The LLM responds, the tool calls execute, the loop continues. From an infrastructure perspective, everything is healthy. From a business perspective, you are burning money and accomplishing nothing. The gap between those two perspectives is where agent incidents live.
The Failure Modes That Traditional Monitoring Misses
Agent failures cluster into four categories, each invisible to standard error-rate monitoring.
Infinite tool loops are the most expensive. An agent assigned a goal with no completion criteria will continue issuing tool calls indefinitely. The variant that causes the most damage is the soft loop: the agent is not calling the exact same tool repeatedly (which deduplication would catch), but rather re-phrasing the same query, adding one word to a search, or shifting a parameter slightly, while making no actual progress toward its goal. Each iteration consumes tokens. The context window grows. Cost per call increases as the window fills. Teams typically discover these through billing data, not monitoring.
Context overflow in long-running workflows appears when a multi-step task accumulates tool outputs, prior reasoning, and conversation history until the context window fills. Unlike a memory allocation failure, this does not crash the process — the agent simply starts losing track of earlier steps, producing increasingly incoherent reasoning while continuing to execute. One documented materials science workflow consumed 20 million tokens before reaching this state; re-implementing it with memory pointers (short identifiers replacing full data) reduced token usage by over 99%.
Hallucinated API calls occur when an agent constructs a request that looks syntactically valid but uses parameters that do not exist in the actual API schema. The specific failure mode that causes runaway cost is this: the agent cannot distinguish "the API rejected my request" from "the task is impossible," so it retries the same malformed call, with a growing context window, hundreds of times. One reported incident cost $2,000 from 847 retries of the same failed call.
Schema drift and credential rot are the most common silent production failures. When a tool's API schema changes — parameter names renamed, required fields added, authentication flows updated — the agent continues operating against its cached understanding of the schema. Calls fail with validation errors the agent does not interpret correctly, leading to confused behavior rather than clean error escalation. Certificate expirations discovered only after months-long silent renewal failures are another variant of this pattern.
The compound math is brutal. An agent with 85% per-step accuracy achieves roughly 20% end-to-end success on a ten-step workflow. That number assumes failures produce clean signals. When failures produce silent loops instead, the effective success rate looks fine while cost and time spiral.
What Your SRE Stack Is Not Measuring
Most observability stacks track availability, latency, and error rates. These metrics are necessary but insufficient for agents. A healthy-looking system can simultaneously be:
- Burning 10x the expected token budget on a single session
- Making no progress on the user's actual goal
- Running a tool loop that has been active for six hours
- Executing against a schema that has been deprecated for three months
The observability gap is structural. Token cost is an asynchronous metric — by the time a budget alert fires on a daily spend threshold, additional API cycles have already completed. Progress toward goal is a semantic metric — it does not appear in HTTP status codes or database write counts. Tool argument similarity is a per-call behavioral metric that requires instrumentation most teams have not built.
The industry is converging on OpenTelemetry as the standard framework for agent tracing. The key insight driving adoption is the distinction between infrastructure observability (is the service running?) and behavioral observability (is the agent doing the right thing?). You need both.
What behavioral observability requires:
- Per-step token consumption tracked against per-session budgets, not daily aggregates
- Tool invocation history with argument deduplication to detect soft loops
- Step count toward goal with a halt trigger after N steps without measurable state change
- Context window utilization tracked as a percentage, with alerts at 70% capacity
- Halt reason codes that distinguish timeout, max_steps, cost ceiling, and error from intentional completion
A three-tier tracing architecture — trace level for full user interactions, span level for individual LLM calls and tool invocations, attribute level for structured metadata — gives you the granularity to debug agent incidents without reconstructing behavior from logs after the fact.
Cost Circuit Breakers: Why Alerts Are Not Enforcement
The $47,000 incident is instructive precisely because the team was not negligent. They had billing alerts. The alerts worked as designed. The problem is that budget alerts report what already happened; they do not prevent the next API call from happening.
Effective cost control for agents requires synchronous enforcement at the API call level. Before each LLM request or tool invocation, the execution layer checks the session's remaining budget. If the budget is exhausted, the call does not happen. This is a fundamentally different architecture from alerting on aggregate spend.
The practical implementation sits at the proxy or middleware layer, independent of agent framework code. Keeping enforcement out of agent prompts is not optional — any cost limit expressed in a prompt is a soft constraint that the model can reason around, override in edge cases, or simply ignore under adversarial input. Enforcement belongs in code that runs before the model sees the request.
A reasonable default budget structure by agent type:
- Research agents (unbounded exploration tasks): $5 per session
- Operational agents (specific task execution): $0.50 per session
- Conversational agents (user-facing dialogue): $0.10 per turn
These are starting points, not universal values. The important property is that they are per-session hard caps, not daily aggregates. A runaway agent that hits its per-session ceiling at 11pm costs you the session budget, not eleven days of API charges.
Layered defenses add resilience: per-task ceilings inside per-session budgets inside per-agent-type fleet ceilings. The fleet ceiling is the circuit breaker that prevents a single misconfigured agent deployment from affecting the broader system.
Writing Runbooks That Actually Cover Agent Incidents
Traditional runbook structure maps well to infrastructure failures: symptom, diagnosis commands, remediation steps, escalation path. Agent incidents require the same structure but different content.
Symptom detection for agent incidents:
- Cost alert fires with no corresponding error rate increase → investigate active sessions for loops
- Session duration exceeds P99 for agent type → check step count and argument similarity in traces
- Tool error rate spikes on specific endpoint → check for schema drift against current API spec
- User goal completion rate drops while technical success rate holds → semantic regression
Diagnosis for infinite loops:
Pull the span trace for the session. Look for repeated (tool, arguments) pairs with argument similarity above a threshold. Check whether state has changed between steps — if the agent's understanding of the world has not updated across five consecutive steps, it is in a soft loop. Cost-per-step visualization across the session will show the characteristic upward slope of context window inflation.
Remediation:
The immediate action for a runaway session is termination via the execution layer — session kill commands should be in the runbook alongside service restart commands. After termination, the task enters a dead-letter queue for human review. Do not automatically retry without first understanding why the loop formed.
Escalation thresholds:
Unlike infrastructure incidents where you escalate based on error rate or user impact, agent incidents escalate on: budget ceiling hit (immediate), step count above 3x the agent type's P99 (investigate within 15 minutes), tool error rate above 5% for schema-related errors (investigate for drift within one hour). Human-in-the-loop gates should be mandatory for any agent action that is irreversible — file deletions, external API writes, payment operations — regardless of the agent's confidence level.
The Infrastructure Layer Your Agent Needs
Most agent reliability problems are not model problems. They are infrastructure problems that look like model problems because the infrastructure layer does not exist yet.
The components that prevent the failure modes above:
Durable execution frameworks (such as Temporal or LangGraph's persistence layer) enable checkpointing — saving execution state between steps so an interrupted agent can resume from the last known-good state rather than restarting from scratch. This is table stakes for any agent workflow longer than a single LLM call.
Tool schema validation on startup — not at call time — catches drift before it causes incidents. At initialization, the execution layer should verify each tool's schema against the live API specification. Schema mismatches block the agent from starting rather than manifesting as cryptic errors during execution.
Idempotent tool design enables safe retries. If calling a tool twice with identical parameters produces identical results with no side effects, you can retry on failure without risk. This requires deliberate API design, but it is the foundation that makes circuit breakers and automatic recovery possible without creating duplicate state.
Staged context compaction keeps long-running workflows from hitting context limits. Rather than loading the entire conversation history into every request, effective implementations use memory pointers, summarization at checkpoints, and priority-ranked eviction to keep context within budget while preserving the information the agent actually needs.
The transition from "AI features" to "AI infrastructure" is where most teams are right now. Ninety percent of agent projects fail within 30 days, and the failure mode is not the model — it is the absence of the surrounding engineering that makes models reliable. Budget enforcement, behavioral observability, checkpointing, and schema validation are not advanced topics. They are table stakes for operating agents in production.
A Starting Oncall Checklist
Before your next agent deployment goes live, verify these properties exist:
- Per-session budget caps enforced at the execution layer, not in prompts or alerts
- Span-level tracing with tool argument capture and step-count tracking
- Soft loop detection: halt after N consecutive steps without state change
- Context window utilization alerts at 70% and 90% capacity
- Dead-letter queue and human review flow for sessions terminated by circuit breaker
- Schema validation against live API specs on agent startup
- Irreversible action gates requiring human approval regardless of model confidence
- Documented halt reason taxonomy so oncall can distinguish timeout from loop from cost ceiling
The $47,000 loop ran for eleven days because none of these existed. Each item on this list is a specific failure mode that has occurred in production at real organizations. Building them is not premature optimization — it is the baseline for responsible operation.
At 3am, you want a runbook that tells you what to look at and what to do. The work to make that possible happens now.
- https://medium.com/data-science-collective/why-ai-agents-keep-failing-in-production-cdd335b22219
- https://dev.to/waxell/the-47000-agent-loop-why-token-budget-alerts-arent-budget-enforcement-389i
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://www.agentpatterns.tech/en/failures/infinite-loop
- https://arize.com/blog/common-ai-agent-failures/
- https://dev.to/sapph1re/how-to-stop-ai-agent-cost-blowups-before-they-happen-1ehp
- https://relayplane.com/blog/agent-runaway-costs-2026
- https://runyard.io/blog/swarm-budgets-cost-control
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://oneuptime.com/blog/post/2026-03-14-monitoring-ai-agents-in-production/view
- https://slack.engineering/managing-context-in-long-run-agentic-applications/
- https://www.replicant.com/blog/when-to-hand-off-to-a-human-how-to-set-effective-ai-escalation-rules
- https://temporal.io/blog/what-are-multi-agent-workflows
- https://composio.dev/blog/why-ai-agent-pilots-fail-2026-integration-roadmap
