Skip to main content

When Your AI Agent Consumes from Kafka: The Design Assumptions That Break

· 11 min read
Tian Pan
Software Engineer

The standard mental model for AI agents assumes HTTP: a client sends a request, the agent processes it, returns a response. Clean, synchronous, easy to reason about. When an LLM-powered function fails, you get an error code. When it succeeds, you move on.

Once you swap that HTTP interface for a Kafka topic or SQS queue, every one of those assumptions starts to crack. The queue guarantees at-least-once delivery. Your agent is stochastic. That combination produces failure modes that don't exist in deterministic systems—and the fixes aren't the same ones that work for traditional microservices.

This post covers what actually changes when AI agents consume from message queues: idempotency, ordering, backpressure, dead-letter handling, and the specific failure mode where a replayed message triggers different agent behavior the second time around.

The Problem with "At-Least-Once" and a Stochastic Consumer

In classical distributed systems, at-least-once delivery is manageable because deterministic functions are idempotent. If your function computes hash(message), running it twice on the same message produces the same output. The duplicate is harmless.

LLMs are not deterministic. Even with temperature=0, floating-point non-determinism across hardware configurations, batching effects, and provider-side implementation details can produce different outputs on repeated invocations of the same prompt. At non-zero temperatures, divergence is a design feature, not a bug.

This breaks the idempotency assumption that queue-based systems are built on. Consider a concrete scenario:

  1. Your agent consumes a message, calls the LLM, and writes "Summary A" to the database.
  2. Before committing its offset, the consumer crashes.
  3. The broker redelivers the message.
  4. The agent calls the LLM again and produces "Summary B."
  5. You now have two summaries for one message, from the same source, and no way to know which is authoritative.

With a deterministic function, both runs would have written the same value. The duplicate write would be harmless. With an LLM, you have a data consistency problem.

The fix is to move idempotency to the layer above the model. Before invoking the LLM, check a deduplication store keyed by the message's idempotency identifier (SQS provides MessageId natively; for Kafka, you'll construct one from topic + partition + offset). If a result already exists, return the cached value. Only call the model if no result is stored, and persist the output atomically before committing the consumer offset.

This pattern—check, invoke, persist, commit—is non-negotiable for any LLM consumer that produces side effects.

Ordering Ambiguity Is Worse Than You Think

Kafka guarantees ordering within a partition. Across partitions, all bets are off. For deterministic consumers, this is manageable: you partition by entity key (e.g., user_id), and all events for a given user land on the same partition, processed in sequence.

For AI agents, partition key selection becomes a semantic decision, not just a performance one. If your agent maintains conversational state—tracking what the user said in message 3 to inform message 7—then messages for the same conversation must land on the same partition. Use a stable conversation or session identifier as the partition key.

The failure mode when you get this wrong is subtle. Say an agent is generating a multi-step workflow plan. Task events arrive in order: step 1, step 2, step 3. If steps land on different partitions, step 3 might be processed before step 2. The agent generates a plan based on incomplete context. No error is thrown. The pipeline completes successfully. The output is wrong.

This is also why replay semantics deserve explicit testing. Most Kafka issues surface during replays, not during steady-state operation. If you replay from an earlier offset to recover from a bug, your agent will reprocess old events. Any agent that isn't idempotent at the task level will produce duplicate side effects during replay—duplicate tool calls, duplicate writes, duplicate emails sent.

Test your replay path in staging before it becomes an incident response scenario in production.

Backpressure Is Consumer Lag, Not HTTP 429s

In an HTTP-based system, backpressure comes from the server: you return 429 when you're at capacity. The client sees the status code and backs off.

In an event-driven system, backpressure works through consumer lag. The broker doesn't know or care how fast your agent processes messages. It just appends events. Consumers pull at their own pace and commit offsets when they're done. The difference between the latest offset and the consumer's committed offset is the lag.

For LLM workloads, lag is the primary health signal. When your agent is processing within SLA, lag stays bounded. When the LLM API slows down—provider latency spike, rate limit, cold-start—consumers fall behind and lag grows. Upstream producers don't need to know this is happening; they just keep publishing. The queue absorbs the burst.

This is actually an architectural advantage over HTTP. You don't need coordinated backpressure signaling across services. The queue buffers naturally.

The problem is that token cost doesn't slow down automatically. A consumer processing messages faster than your LLM rate limit generates 429 errors from the provider and spends time in exponential backoff—but the queue keeps filling. You need token-aware rate limiting at the consumer level, not just request-level rate limiting.

Concretely: before invoking the LLM, check a sliding-window counter that tracks tokens consumed per minute. If you're near the provider's TPM (tokens per minute) limit, pause consumption and wait. Commit offsets only for messages you've fully processed. Let the lag grow temporarily rather than burning tokens on retries.

Consumer Group Design for Agent Workers

Kafka's consumer group model assigns partitions to consumers dynamically. When a new consumer joins the group, the broker rebalances, reassigning partitions. This gives you horizontal scalability: add more agent workers and the broker automatically distributes load.

For LLM consumers, rebalancing has a catch. If your agent is mid-inference when a rebalance triggers—a slow LLM call that takes 15 seconds while the broker's session timeout is 10—the consumer gets evicted from the group. The partition gets reassigned to another worker, which picks up the message from the last committed offset, processes it again, and you're back to the stochastic-duplicate problem.

Two mitigations:

First, set max.poll.interval.ms to a value that accounts for worst-case LLM latency. If your 99th-percentile LLM call takes 30 seconds, set the interval to 60 seconds with a margin.

Second, fetch small batches per poll. Instead of fetching 500 messages and processing them serially, fetch 5–10, process them, commit, then fetch more. This keeps the commit cadence high and reduces the window of exposure during rebalancing.

The SQS equivalent is visibility timeout. If your agent takes 45 seconds to process a message but the visibility timeout is 30 seconds, the message becomes visible again before you're done, and another consumer picks it up. Set visibility timeout to at least twice your maximum expected processing time, and extend it programmatically for long-running tasks.

Dead-Letter Queues for AI Failures

Traditional DLQ handling is binary: a message fails N times, it goes to the dead-letter queue. Operations teams investigate.

AI agent failures are more nuanced. An LLM might:

  • Time out because the provider is slow (transient—safe to retry)
  • Return a malformed JSON response that fails schema validation (may be retried safely with a different temperature)
  • Call a tool with hallucinated parameters that don't exist (data issue—needs human review)
  • Book a flight, get a timeout confirmation, then retry and book another flight (irreversible action—needs compensation)

Treating all of these with uniform retry-then-DLQ logic produces bad outcomes. What's emerging in practice is a two-stage approach: classify the failure before routing to DLQ.

Some teams now use an LLM itself to triage DLQ messages—provide the original message, the error, and the agent's output, and ask for a routing decision: REPROCESS (transient), HUMAN_REVIEW (logic failure), COMPENSATE (irreversible action taken). The classification gets written to an audit log alongside the original message.

This is architecturally sound, but introduces a new dependency: your failure-handling path now also invokes an LLM, which can fail. Ensure your DLQ classifier has hard timeouts and a deterministic fallback (route to HUMAN_REVIEW by default) when the classifier itself fails.

The Specific Failure: Stochastic Transform Poisoning Downstream State

The failure mode that bites teams most in practice combines all of the above: a stochastic agent transform, at-least-once delivery, and a downstream system that can't distinguish between duplicate processing and a legitimate second event.

Here's the full scenario:

  1. An agent consumes a product review and generates tags: ["helpful", "shipping"]. These are written to a search index.
  2. The consumer crashes before committing its offset. The broker redelivers the message.
  3. The agent processes again and generates different tags: ["detailed", "packaging"].
  4. Both sets of tags are now in the index. The product shows up in searches for all four terms, none of which accurately represent the review.

The search index has no way to know that the second write was a duplicate—because at the data level, it wasn't. It was a different output from a different model invocation.

The canonical fix is a three-step write protocol:

  • Before invoking the LLM, write a "processing" sentinel to the dedup store with the message's idempotency key.
  • After invoking the LLM, write the result with a compare-and-swap: only update if the key is still in "processing" state.
  • On retry, find the sentinel in "processing" state and either wait (if the original invocation is still running) or treat it as an abandoned invocation and restart from scratch—but with the understanding that the downstream write must be idempotent (upsert, not append).

The underlying invariant: your downstream writes need to be idempotent regardless of which LLM output they come from. For search indexes, that means overwrite-by-document-id, not append. For databases, use upsert semantics. For emails and irreversible actions, your idempotency check must happen before the LLM is invoked—if the action is already recorded as taken, return the recorded result without calling the model again.

Kafka vs. SQS: What the Choice Implies for Agents

The queue technology shapes the architectural patterns available to you.

Kafka retains messages for configurable durations (hours to weeks) and supports arbitrary replay from any offset. Multiple consumer groups can read the same event independently. This is the right substrate for agents that need event sourcing semantics—where the full history of events is the authoritative state, and agents can reconstruct their working context by replaying from a known-good offset.

The cost is operational complexity: partition management, offset tracking, consumer group rebalancing, and replay testing. Kafka also doesn't support per-message visibility timeout or automatic retry—you build that yourself or delegate to a framework.

SQS is simpler. Messages are consumed once (or requeued if processing times out), visibility timeout handles in-flight exclusion, and DLQ routing is built in. For task queue semantics—where one agent picks up one job, processes it, and marks it done—SQS is the right tool. The trade-off is no replay and no fan-out (unless you pair SQS with SNS).

A common pattern: use Kafka for event streaming (many agents need to react to the same business event), and SQS for task dispatch (an agent needs to execute a specific unit of work exactly once). Kafka handles observation; SQS handles execution.

What to Audit Before Moving an Agent to a Queue

If you're migrating an existing HTTP-based agent to an event-driven model, these are the questions to answer before the migration:

  • Does your agent produce side effects? If yes, you need an idempotency check before every LLM invocation and idempotent downstream writes.
  • Does your agent maintain state across invocations? If yes, you need stable partition keys and an explicit state store that survives consumer rebalancing.
  • What's your worst-case LLM latency? Your consumer poll interval and queue visibility timeout must exceed it with margin.
  • Have you tested replay? Deliberately replay events from an earlier offset in staging. Verify the agent produces consistent, non-duplicated results.
  • Do you have token-aware rate limiting? Per-minute request counts are insufficient. Measure TPM against your provider's limit before deploying at scale.

Event-driven architectures give AI agents the decoupling and scalability that synchronous HTTP can't. The queue absorbs traffic spikes, enables fan-out to multiple agent consumers, and provides a durable audit trail of every event an agent processed. But those benefits only materialize if the agent is designed for the delivery semantics of the queue—not the semantics of an HTTP server that happens to be reading from a topic.

The difference between a well-designed event-driven agent and a fragile one often comes down to a single design decision: whether idempotency is enforced at the model invocation boundary, not just at the message processing boundary.

References:Let's stay in touch and Follow me for more thoughts and updates