Skip to main content

Event-Driven Agent Scheduling: Why Cron + REST Calls Fail for Recurring AI Workloads

· 11 min read
Tian Pan
Software Engineer

The most common way teams schedule recurring AI agent jobs is also the most dangerous: a cron entry that fires a REST call every N minutes, which kicks off an LLM workflow, which either finishes or silently doesn't. This pattern feels fine in staging. In production, it creates a class of failures that are uniquely hard to detect, recover from, and reason about.

Cron was designed in 1975 for sysadmin scripts. The assumptions it encodes—short runtime, stateless execution, fire-and-forget outcomes—are wrong for LLM workloads in every dimension. Recurring AI agent jobs are long-running, stateful, expensive, and fail in ways that compound across retries. Using cron to schedule them is not just a reliability risk. It's a visibility risk. When things go wrong, you often won't know.

The Specific Ways Cron Breaks for LLM Jobs

The failure modes aren't theoretical. They follow from the mismatch between what cron assumes and what LLM jobs actually do.

Silent failure. Cron captures exit codes, not business outcomes. An agent that times out waiting on an LLM API, catches the exception, and exits 0 registers as a success. Without active instrumentation, you have no idea whether the job produced output, consumed tokens, or partially updated downstream state. Teams discover these silent failures when customers complain—not from dashboards.

Overlapping execution under load. Cron schedules jobs at fixed intervals with no awareness of whether the previous run is still active. If an LLM job normally takes 45 seconds but an API provider slows down during peak hours and the job runs for 90 seconds, the next cron tick fires into a still-running worker. Now you have two instances in flight against the same data, potentially writing conflicting results to the same records. For multi-step agent workflows, this is particularly destructive because intermediate state gets corrupted by the second instance before the first completes.

Thundering herd. At scale, cron creates synchronization problems. If you have 200 tenant accounts each with their own scheduled agent job set to run at the top of the hour, all 200 fire simultaneously. API rate limits get hit. Token budgets exhaust. The LLM provider returns 429s across the board. Now every job is retrying at the same time, making the problem worse. Regular background jobs can absorb this because they're fast and cheap; LLM jobs are slow and expensive, which means the recovery window is longer and the cost during that window is higher.

Provider outage amplification. When an LLM API experiences a sustained outage—not an immediate 500 but a slow timeout—cron jobs wait out their full timeout window before failing. If your agent has a 3-minute timeout and the provider is down for 20 minutes, you get six or seven full-timeout-duration failures stacked back to back, each consuming a cron slot. The entire scheduling window becomes blocked. Jobs that should have run during the outage are simply missed, with no mechanism for catch-up.

Zero cost attribution. Cron has no concept of which tenant, workflow, or request caused a particular token spend. When a large language model bill arrives and a specific agent is consuming 10x what you projected, there's nothing in the cron infrastructure to help you trace it. Every production team that has scaled LLM workloads has discovered this problem—the hard way.

What Event-Driven Architecture Gives You Instead

The fix isn't about a specific tool. It's about adopting the right model: producers push work onto a queue; workers pull from it asynchronously; the queue itself becomes the source of truth for job state.

This model solves each failure mode above. Workers can use distributed locking or idempotency checks to ensure only one instance processes a given job at a time, eliminating overlapping execution. Messages can be distributed across time using jitter, breaking the thundering herd synchronization. Dead-letter queues capture persistent failures for inspection and replay rather than silently discarding them. And because every message carries metadata—tenant ID, workflow type, triggering event—cost attribution becomes possible.

There's a second benefit that's less obvious: decoupling intake from execution. With cron, the "decision to do work" and the "doing of work" happen in the same process at the same time. With a queue, job creation is synchronous and fast; job execution is asynchronous and can absorb backpressure. When your LLM provider goes down, new jobs keep queuing. When it comes back, workers drain the backlog in a controlled way, with retry policies and rate limits in place.

The Four Architectural Primitives You Need

Any production-grade agent scheduling system needs these four things. Tools vary; the primitives don't.

Idempotency keys. Every job must carry a unique identifier derived from the business operation it represents—not a random UUID generated at enqueue time. Use natural business keys: account_id + workflow_type + time_window. Before processing a message, the worker checks whether that key has already been successfully executed in persistent storage. If yes, it acknowledges the message and exits. This is what protects you from duplicate execution when the message broker delivers a message twice (which all message brokers will do under at-least-once delivery semantics).

Dead-letter queues. When a job exhausts its retry attempts, it should not be silently dropped. It should move to a separate queue with the original payload, the full retry history, and the reason for each failure. Dead-letter queues are where you do post-mortems, write replay scripts, and discover systemic issues. A DLQ with no messages means either everything is working or failures are being silently swallowed—and you can tell which by whether you have alerting on DLQ depth.

Exponential backoff with jitter. Retrying immediately after a failure makes most failure modes worse. The correct pattern: double the wait interval after each failure (1s → 2s → 4s → 8s), then add a random jitter factor. The jitter prevents the retry storm where all failed jobs retry at the same moment. Research consistently shows this reduces retry-induced load spikes by 60–80% compared to fixed-interval retries.

Circuit breakers. When an LLM provider is consistently returning errors, individual circuit breakers per worker aren't enough—you want a shared state that detects "the provider is down" and stops all workers from attempting calls during the outage window, fast-failing instead of waiting out full timeouts. This transforms a 20-minute provider outage from "20 minutes of full-timeout failures burning cron slots" into "20 minutes of fast failures that drain quickly once the provider recovers."

Checkpoint-Resume for Multi-Step Agents

Standard message queue patterns assume jobs are short and can be retried from the beginning on failure. AI agents often can't be. A 10-step research agent that has already fetched five URLs, parsed three documents, and written intermediate results to a database can't simply restart from step one—that means re-incurring those API costs and potentially corrupting state that was already written.

The correct primitive for multi-step agent workflows is checkpoint-resume: at each step boundary, the execution state is persisted. On failure, the job resumes from the last checkpoint rather than from the beginning.

Temporal implements this natively through durable execution. The workflow is written as sequential imperative code, and the platform handles transparent state checkpointing and deterministic replay under the hood. When a worker dies mid-workflow, a new worker picks up the workflow ID and replays from the last checkpoint, skipping already-completed steps. AWS Lambda Durable Functions (released in 2025) provides a similar model in a serverless context—write sequential code, the platform handles checkpointing.

The difference from Celery or BullMQ is significant. Both of those tools are stateless task queues: they can retry a failed task, but they don't know that the failed task was step 5 of a 10-step workflow and that steps 1–4 already completed. You have to build that tracking yourself, which most teams don't, which means agent failures result in full restarts or manual intervention.

The Lindy AI case study from 2025 illustrates the cost of not having this. They used BullMQ initially and discovered silent failures whenever their underlying services timed out, no durable recovery when pod shutdowns hit mid-workflow, and no visibility into why agents had failed. After migrating to Temporal Cloud, they reached 2.5 million Temporal actions per day with dramatically fewer silent failures. The operational complexity of maintaining BullMQ at that scale exceeded the cost of adopting purpose-built infrastructure.

The Observability Layer You Can't Skip

One underrated aspect of this architecture is that a proper message queue setup forces you to instrument things you'd otherwise leave invisible.

Every message should carry structured metadata that flows through the entire execution: tenant or account identifier, workflow type, model tier, triggering event, enqueue timestamp. Workers emit structured logs at each step with token counts, latency, and cost estimates. The aggregate becomes a real-time view of your agent's operational profile.

This matters because AI workloads have variable costs in ways that traditional software doesn't. Two identical-looking jobs can have 10x cost differences depending on input complexity, context size, and how many tool calls the model decides to make. Without per-job cost attribution, you're flying blind on unit economics. Teams processing high LLM request volumes have reported saving significant operational costs simply by adding monitoring that revealed which specific workflows were consuming disproportionate token budgets.

Pair this with DLQ alerting—trigger a notification when DLQ depth exceeds a threshold—and you get a failure signal that's far more actionable than "agent job X failed" buried in a log file somewhere.

Choosing the Right Tool

For teams already in the AWS ecosystem, SQS plus Lambda provides a managed, serverless queue-worker architecture with DLQ support built in. Lambda Durable Functions adds checkpoint-resume capability for multi-step workflows. The operational overhead is low; the primitives are correct.

For teams running Python ML workloads and needing something simpler, Celery with Redis or RabbitMQ covers the basics—though Celery lacks native multi-step state tracking and requires external tooling (Flower) for observability. It's appropriate for short, stateless agent tasks where full restart is acceptable.

For teams running complex, long-duration, multi-step agent workflows at scale, Temporal is the current production-grade answer. It provides durable execution, deterministic replay, built-in UI for workflow inspection, and fine-grained control over retry policies at the step level. The Temporal $146M raise in 2025 was specifically framed around agentic AI demand, which reflects where the market has recognized the need.

Kafka is sometimes proposed for agent scheduling because of its replayability, but it's generally overkill for this problem. Kafka excels at high-throughput event streaming; agent job scheduling needs reliability, state tracking, and operational simplicity more than it needs throughput. Starting with Kafka for agent scheduling is a common overengineering mistake.

The Mental Model Shift

The underlying shift is from thinking about "running a job" to thinking about "processing a unit of work." A job is a process that runs on a schedule. A unit of work is a message that exists until it is successfully processed—and the infrastructure around it ensures that eventually, it will be.

Cron cannot guarantee that. A cron job that fails at 2am is gone. There is no record of what it was supposed to do, what it attempted, or what partial state it left behind. The message queue model makes "work exists until it's done" a first-class guarantee that the infrastructure enforces, rather than something your application code has to implement itself.

That guarantee is what production AI agents actually need. The models are good enough now. The infrastructure around them is where most production failures happen—and where the engineering investment delivers the most return.

The analysis from 1,200 production agent deployments in 2025 found a consistent pattern: teams that prioritized infrastructure guardrails over prompt iteration shipped more reliable systems. The agents that survived production weren't the ones with the smartest prompts. They were the ones that couldn't silently fail.

References:Let's stay in touch and Follow me for more thoughts and updates