Skip to main content

The Idempotency Crisis: LLM Agents as Event Stream Consumers

· 11 min read
Tian Pan
Software Engineer

Every event streaming system eventually delivers the same message twice. Network hiccups, broker restarts, offset commit failures — at-least-once delivery is not a bug; it's the contract. Traditional consumers handle this gracefully because they're deterministic: process the same event twice, get the same result, write the same record. The second write is a no-op.

LLMs are not deterministic processors. The same prompt with the same input produces different outputs on each run. Even with temperature=0, floating-point arithmetic, batch composition effects, and hardware scheduling variations introduce variance. Research measuring "deterministic" LLM settings found accuracy differences up to 15% across naturally occurring runs, with best-to-worst performance gaps reaching 70%. At-least-once delivery plus a non-deterministic processor does not give you at-most-once behavior. It gives you unpredictable behavior — and that's a crisis waiting to happen in production.

If your AI agent consumes from Kafka, SQS, or any at-least-once queue, you are already running with this risk. The duplicate might never materialize in dev. It will find your production system at the worst possible moment: during a broker failover, right after your agent approved a loan, or mid-way through a multi-step customer onboarding workflow.

Why Traditional Idempotency Fails for LLMs

Classical idempotency is built on a simple guarantee: f(x) = f(x). Process the same input twice, get the same output twice. This is trivially achievable for deterministic business logic. If your consumer function updates a database row based on a fixed rule, you can safely replay it. The database write either completes or is already done.

LLM inference breaks this in two ways.

First, the output itself is non-deterministic. An LLM deciding "should this transaction be flagged as fraudulent?" might answer "yes" on the first pass and "no" on the second. The event is identical. The environmental state is identical. The answer is different. A deduplication strategy that simply checks whether an event ID was processed before cannot help here, because the question is not whether the event was processed — it's what was decided when it was processed.

Second, LLM side effects are often hard to reverse. A deterministic consumer that double-processes an event might insert a duplicate database row, which you can detect and ignore with a unique constraint. An LLM-powered consumer might send an email, trigger a downstream API call, update a customer's credit tier, or emit a follow-on event. These actions are not naturally idempotent. The constraint you'd normally lean on — replay safety — requires knowing in advance what the LLM will do, which you can't.

The architecture implications are significant: you cannot treat LLM inference as just another function inside an at-least-once pipeline without additional machinery.

The Deduplication Window: Stop Reprocessing, Start Caching

The first pattern to add is a deduplication window — a persistent record of which events have already been processed and what decision was made.

The implementation is straightforward but requires discipline. Before invoking the LLM, your consumer checks a state store (a database table, Redis hash, or Kafka Streams state store) for the incoming event's ID. If the event has been seen before, it retrieves and replays the stored decision rather than rerunning the LLM. If the event is new, it runs the LLM, stores the result, and then performs the side effects.

The storage record needs three fields at minimum: the event ID (or a composite idempotency key), the stored output, and a timestamp. The timestamp is critical because you cannot store event results forever — disk pressure is real. Industry practice for Kafka-based systems gravitates toward 32-day deduplication windows, which covers virtually all realistic broker retry windows while keeping storage manageable.

Composite idempotency keys often work better than raw event IDs. An event ID tells you whether you've seen this exact message before; a composite key like customerId:orderId:actionType tells you whether you've performed this logical operation before, regardless of whether the same message was re-emitted with a different envelope. For business operations, logical deduplication is usually what you actually want.

The atomic write is where most implementations go wrong. You must store the LLM decision and mark the event as processed in the same transaction as the downstream state change. If you store the decision, then crash before updating downstream state, you've created an inconsistent system. On retry, you'll replay the stored decision rather than reinvoking the LLM — but downstream state was never updated. The solution is to write the idempotency record and the business state change atomically, using a database transaction or the transactional outbox pattern.

Separating the Decision Log from the Read Model

Once you have deduplication, you need to think carefully about what you're actually storing and how consumers downstream use it.

The pattern that works best is separating your decision log from your read model. The decision log is an append-only record of every LLM inference: the event ID, the timestamp, the model version, the raw output, and a confidence score if your pipeline produces one. It never changes. The read model is a denormalized projection of current business state that downstream systems query.

When a duplicate event arrives and you retrieve the cached decision, you replay it into the read model idempotently. The read model update logic is deterministic — it takes a stored decision and applies it to current state using fixed rules. The LLM's non-determinism is now isolated to the initial inference, which happens exactly once.

This separation also gives you audit and replay capability for free. If a model is rolled back, you can re-process the decision log with the new model behavior, recompute read models, and compare outputs before promoting the change. If a decision is disputed, you have a permanent record of exactly what the model decided and when.

The tradeoff is operational complexity. You're running two writes per event (decision log + read model) and enforcing transactional consistency between them. For high-volume pipelines, this matters. Batch your writes, use connection pooling, and consider whether your state store (Postgres, DynamoDB, Cassandra) is sized for the write amplification. At 100K events per second, even small per-event latency costs accumulate.

Compensating Transactions for Multi-Step Workflows

Single-event consumers are straightforward once you have deduplication and decision logging. Multi-step agentic workflows are harder, because a crash or duplicate mid-workflow can leave the system in a partially applied state that neither continues nor fully unwinds.

The Saga pattern addresses this directly. Instead of one atomic transaction spanning multiple services, a saga is a sequence of local transactions, each paired with a compensating transaction that undoes its effects. If step 3 of a 5-step workflow fails, the saga coordinator emits compensating events for steps 1 and 2.

For LLM-driven workflows, the key constraint is that compensating transactions must be designed before the forward workflow executes. You cannot design a compensating transaction after the LLM decides what it wants to do, because the compensating logic needs to be deterministic and pre-coded. The LLM makes decisions; your infrastructure codes the reversals.

A practical example: an LLM agent processing insurance claims decides to approve a claim and request a payout. The forward events are: mark-claim-approved, schedule-payout, notify-customer. If notification fails and the saga rolls back, compensating events are: cancel-payout, mark-claim-pending, log-rollback. The compensating events are simple rule-based actions that your engineers write. They do not re-invoke the LLM.

This pattern demands explicit workflow state. You need to know, at any point in time, which steps have executed and which have been compensated. Store this as a workflow record in a durable store, updated atomically with each step. Tools like Temporal or AWS Step Functions provide this state machine bookkeeping out of the box — their "activity" model maps cleanly to saga steps.

Event Sourcing as Idempotency Foundation

For systems where full auditability and replay are first-class requirements, event sourcing gives you idempotency as a structural property rather than a bolt-on mechanism.

In an event-sourced system, the source of truth is not current state but the ordered log of events that produced that state. Current state is a projection, derived by replaying events from the log. This means you can replay the same sequence of events as many times as you want and always arrive at the same current state — as long as your projection logic is deterministic.

The catch for LLM agents is that the LLM output must itself be recorded as an event. "LLM decided to approve this loan" is an event in the log, not just a step in a function. When you rebuild state by replaying events, you replay the stored LLM decision rather than rerunning the LLM. The projection logic is deterministic because it consumes the recorded decision, not the live model.

This is essentially the decision log pattern implemented at the infrastructure layer. Event sourcing makes it explicit and enforced by architecture rather than by developer discipline.

The operational costs are real: event logs grow indefinitely and require snapshotting strategies to keep startup replay time bounded. But for AI systems where compliance, auditability, and model rollback are genuine requirements — fraud detection, credit decisioning, medical triage — the properties event sourcing provides are difficult to replicate any other way.

What to Build First

The patterns above stack on each other. If you're adding LLM processing to an existing event-driven system, the pragmatic rollout order is:

  1. Deduplication window first. Add an idempotency key check before every LLM invocation. Store the event ID (or composite key) and the LLM output atomically with your business write. This is the minimum viable protection and takes a day to implement.

  2. Decision log second. Separate your LLM output storage from your business state storage. This costs another day but enables model rollbacks and the audit capability you'll eventually need.

  3. Compensating transactions for any multi-step workflows. If your agent produces a chain of actions, define the compensating events before you build the forward flow. This is harder — it requires thinking through every failure mode — but it's significantly cheaper to design upfront than to retrofit.

  4. Event sourcing for high-compliance domains. Only reach for full event sourcing if your correctness and auditability requirements genuinely demand it. It adds infrastructure complexity that most teams don't need on day one.

The throughput argument for skipping these patterns doesn't hold up at scale. LLM inference already dominates your per-event latency by orders of magnitude. A database write for idempotency tracking adds a few milliseconds to an operation that takes hundreds. The risk/cost ratio is strongly in favor of building this infrastructure early.

The Shift That Changes Everything

Traditional at-least-once resilience is built on the assumption that reprocessing is safe because outputs are reproducible. LLMs dissolve that assumption. The system design response is to make the LLM inference happen exactly once by separating it from the delivery guarantee: at-least-once delivery, exactly-once inference. That's the reframe.

Exactly-once inference means the LLM runs once per logical event, its output is durably stored, and all subsequent processing uses the stored output. The delivery layer can retry as aggressively as it needs to. The inference is isolated behind a deduplication gate. What crosses that gate never touches the model again.

Teams building production AI systems in 2025-2026 are discovering this the hard way. Analysis of over 1,200 LLM deployments found that essentially all mature systems implement some form of message queue with retry logic and circuit breakers — but the majority still treat LLM inference as just another service call rather than as a non-deterministic state transition that requires explicit idempotency contracts. The teams that get this right early ship more reliably, debug incidents faster, and spend less time explaining to stakeholders why the AI gave different answers to the same question.

References:Let's stay in touch and Follow me for more thoughts and updates