Skip to main content

Silent Async Agent Failures: Why Your AI Jobs Die Without Anyone Noticing

· 9 min read
Tian Pan
Software Engineer

Async AI jobs have a problem that traditional background workers don't: they fail silently and confidently. A document processing agent returns HTTP 200, logs a well-formatted result, and moves on — while the actual output is subtly wrong, partially complete, or based on a hallucinated fact three steps back. Your dashboards stay green. Your on-call engineer sleeps through it. Your customers eventually notice.

This is not an edge case. It's the default behavior of async AI systems that haven't been deliberately designed for observability. The tools that keep background job queues reliable in conventional distributed systems — dead letter queues, idempotency keys, saga logs — also work for AI agents. But the failure modes are different enough that they require some translation.

Why Async AI Failures Are Different From Normal Job Failures

A conventional background job fails in one of a few recognizable ways: exception thrown, timeout exceeded, queue message unacknowledged. Each failure has a clear signal. The retry logic kicks in, the dead-letter queue captures it after exhaustion, and an engineer can replay or inspect the failed payload.

AI agent jobs fail in all of those ways, plus several new ones:

  • Semantic failures: The agent completes successfully from the queue's perspective — it returned a result, consumed tokens, updated a database — but the result is wrong. An inventory enrichment agent invents a product SKU. A summarization agent drops the most important clause. A multi-hop research job "answers" a question by confidently citing the wrong source.
  • Silent partial completion: The agent processes steps 1 through 4 of a 6-step workflow, then stalls mid-way. The job is marked complete because the final callback fired. Steps 5 and 6 never ran.
  • Retry-induced duplication: An LLM call times out after 28 seconds. The caller retries. The original call completes 3 seconds later. Now two invocations of the same action have both fired — and unlike a database write, an email send or an external API call may not be idempotent by default.
  • Ghost loops: A retry storm driven by a subtly malformed prompt burns through $15,000 in API spend over a weekend. Nobody finds out until the billing invoice arrives.

The thread connecting all of these is that AI jobs conflate execution success with correctness. Infrastructure-level metrics — job completion rate, queue depth, error count — don't capture whether the agent actually did the right thing.

Dead Letter Queues as a Control Plane, Not a Trash Can

Most engineers reach for dead letter queues as a disposal mechanism: messages land there when everything else has failed, and someone eventually looks at them. This framing is too passive for AI agent systems.

A well-designed DLQ for AI jobs is a control plane. It captures the original payload, the full retry history, the failure reason, and any relevant entity IDs. It exposes a replay interface. It routes different failure classes to different handlers.

The distinction between failure classes matters more for AI than for conventional jobs. Transient failures — rate limit hits, upstream timeouts, transient network errors — deserve automatic retries with exponential backoff. Permanent failures — malformed prompts, policy violations, context window overflows — don't benefit from retrying and should go directly to human review.

A naive retry policy that treats all failures the same will exhaust retries on unrecoverable errors while burning tokens and incurring unnecessary cost. Worse, it will silently drop messages that look like transient failures but are actually semantic errors — the kind that return a success response code but corrupt downstream state.

Routing AI-specific failure types to specific DLQ channels also creates useful telemetry: a sudden spike in context-overflow failures tells you something changed about the input distribution. A cluster of policy-violation failures might indicate a prompt injection attempt. The dead-letter queue becomes a passive sensor for production drift.

Idempotency Keys at Every Step, Not Just the Entry Point

Most teams bolt idempotency keys onto the API layer — generate a key per incoming request, deduplicate at the queue intake. This is necessary but not sufficient for multi-step AI workflows.

The problem surfaces in the middle of a workflow. Consider a 5-step document processing pipeline: extract, classify, enrich, transform, and store. Steps 1 through 3 complete. Step 4 times out. The orchestrator retries from the last checkpoint — but if step 3 wasn't given its own idempotency key, the enrichment call fires again. Depending on the tool being called, this might be benign (a pure lookup) or catastrophic (an action that creates a record, sends a message, or bills a customer).

The correct model is per-step idempotency: every action that has side effects gets its own key derived from the workflow execution ID and the step index. The orchestrator checks whether this key has already been recorded as complete before calling the action. If it has, the step is skipped. If it hasn't, the action fires and the result is stored atomically.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates