The Idempotency Crisis: LLM Agents as Event Stream Consumers
Every event streaming system eventually delivers the same message twice. Network hiccups, broker restarts, offset commit failures — at-least-once delivery is not a bug; it's the contract. Traditional consumers handle this gracefully because they're deterministic: process the same event twice, get the same result, write the same record. The second write is a no-op.
LLMs are not deterministic processors. The same prompt with the same input produces different outputs on each run. Even with temperature=0, floating-point arithmetic, batch composition effects, and hardware scheduling variations introduce variance. Research measuring "deterministic" LLM settings found accuracy differences up to 15% across naturally occurring runs, with best-to-worst performance gaps reaching 70%. At-least-once delivery plus a non-deterministic processor does not give you at-most-once behavior. It gives you unpredictable behavior — and that's a crisis waiting to happen in production.
If your AI agent consumes from Kafka, SQS, or any at-least-once queue, you are already running with this risk. The duplicate might never materialize in dev. It will find your production system at the worst possible moment: during a broker failover, right after your agent approved a loan, or mid-way through a multi-step customer onboarding workflow.
Why Traditional Idempotency Fails for LLMs
Classical idempotency is built on a simple guarantee: f(x) = f(x). Process the same input twice, get the same output twice. This is trivially achievable for deterministic business logic. If your consumer function updates a database row based on a fixed rule, you can safely replay it. The database write either completes or is already done.
LLM inference breaks this in two ways.
First, the output itself is non-deterministic. An LLM deciding "should this transaction be flagged as fraudulent?" might answer "yes" on the first pass and "no" on the second. The event is identical. The environmental state is identical. The answer is different. A deduplication strategy that simply checks whether an event ID was processed before cannot help here, because the question is not whether the event was processed — it's what was decided when it was processed.
Second, LLM side effects are often hard to reverse. A deterministic consumer that double-processes an event might insert a duplicate database row, which you can detect and ignore with a unique constraint. An LLM-powered consumer might send an email, trigger a downstream API call, update a customer's credit tier, or emit a follow-on event. These actions are not naturally idempotent. The constraint you'd normally lean on — replay safety — requires knowing in advance what the LLM will do, which you can't.
The architecture implications are significant: you cannot treat LLM inference as just another function inside an at-least-once pipeline without additional machinery.
The Deduplication Window: Stop Reprocessing, Start Caching
The first pattern to add is a deduplication window — a persistent record of which events have already been processed and what decision was made.
The implementation is straightforward but requires discipline. Before invoking the LLM, your consumer checks a state store (a database table, Redis hash, or Kafka Streams state store) for the incoming event's ID. If the event has been seen before, it retrieves and replays the stored decision rather than rerunning the LLM. If the event is new, it runs the LLM, stores the result, and then performs the side effects.
The storage record needs three fields at minimum: the event ID (or a composite idempotency key), the stored output, and a timestamp. The timestamp is critical because you cannot store event results forever — disk pressure is real. Industry practice for Kafka-based systems gravitates toward 32-day deduplication windows, which covers virtually all realistic broker retry windows while keeping storage manageable.
Composite idempotency keys often work better than raw event IDs. An event ID tells you whether you've seen this exact message before; a composite key like customerId:orderId:actionType tells you whether you've performed this logical operation before, regardless of whether the same message was re-emitted with a different envelope. For business operations, logical deduplication is usually what you actually want.
The atomic write is where most implementations go wrong. You must store the LLM decision and mark the event as processed in the same transaction as the downstream state change. If you store the decision, then crash before updating downstream state, you've created an inconsistent system. On retry, you'll replay the stored decision rather than reinvoking the LLM — but downstream state was never updated. The solution is to write the idempotency record and the business state change atomically, using a database transaction or the transactional outbox pattern.
Separating the Decision Log from the Read Model
Once you have deduplication, you need to think carefully about what you're actually storing and how consumers downstream use it.
The pattern that works best is separating your decision log from your read model. The decision log is an append-only record of every LLM inference: the event ID, the timestamp, the model version, the raw output, and a confidence score if your pipeline produces one. It never changes. The read model is a denormalized projection of current business state that downstream systems query.
When a duplicate event arrives and you retrieve the cached decision, you replay it into the read model idempotently. The read model update logic is deterministic — it takes a stored decision and applies it to current state using fixed rules. The LLM's non-determinism is now isolated to the initial inference, which happens exactly once.
This separation also gives you audit and replay capability for free. If a model is rolled back, you can re-process the decision log with the new model behavior, recompute read models, and compare outputs before promoting the change. If a decision is disputed, you have a permanent record of exactly what the model decided and when.
The tradeoff is operational complexity. You're running two writes per event (decision log + read model) and enforcing transactional consistency between them. For high-volume pipelines, this matters. Batch your writes, use connection pooling, and consider whether your state store (Postgres, DynamoDB, Cassandra) is sized for the write amplification. At 100K events per second, even small per-event latency costs accumulate.
- https://developer.confluent.io/patterns/event-processing/idempotent-reader/
- https://developer.confluent.io/patterns/event-processing/idempotent-writer/
- https://www.morling.dev/blog/on-idempotency-keys/
- https://nejckorasa.github.io/posts/idempotent-kafka-procesing/
- https://www.conduktor.io/blog/building-idempotent-consumers
- https://atlan.com/know/event-driven-architecture-for-ai-agents/
- https://akka.io/blog/event-sourcing-the-backbone-of-agentic-ai/
- https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- https://arxiv.org/html/2408.04667v5
- https://unstract.com/blog/understanding-why-deterministic-output-from-llms-is-nearly-impossible/
- https://arxiv.org/html/2503.11951v3
- https://temporal.io/blog/compensating-actions-part-of-a-complete-breakfast-with-sagas
- https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/prompt-chaining-saga-patterns.html
- https://domaincentric.net/blog/event-sourcing-projection-patterns-deduplication-strategies
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://devops.com/agentic-systems-are-breaking-reliability-frameworks/
