Skip to main content

When Your AI Agent Consumes from Kafka: The Design Assumptions That Break

· 11 min read
Tian Pan
Software Engineer

The standard mental model for AI agents assumes HTTP: a client sends a request, the agent processes it, returns a response. Clean, synchronous, easy to reason about. When an LLM-powered function fails, you get an error code. When it succeeds, you move on.

Once you swap that HTTP interface for a Kafka topic or SQS queue, every one of those assumptions starts to crack. The queue guarantees at-least-once delivery. Your agent is stochastic. That combination produces failure modes that don't exist in deterministic systems—and the fixes aren't the same ones that work for traditional microservices.

This post covers what actually changes when AI agents consume from message queues: idempotency, ordering, backpressure, dead-letter handling, and the specific failure mode where a replayed message triggers different agent behavior the second time around.

The Problem with "At-Least-Once" and a Stochastic Consumer

In classical distributed systems, at-least-once delivery is manageable because deterministic functions are idempotent. If your function computes hash(message), running it twice on the same message produces the same output. The duplicate is harmless.

LLMs are not deterministic. Even with temperature=0, floating-point non-determinism across hardware configurations, batching effects, and provider-side implementation details can produce different outputs on repeated invocations of the same prompt. At non-zero temperatures, divergence is a design feature, not a bug.

This breaks the idempotency assumption that queue-based systems are built on. Consider a concrete scenario:

  1. Your agent consumes a message, calls the LLM, and writes "Summary A" to the database.
  2. Before committing its offset, the consumer crashes.
  3. The broker redelivers the message.
  4. The agent calls the LLM again and produces "Summary B."
  5. You now have two summaries for one message, from the same source, and no way to know which is authoritative.

With a deterministic function, both runs would have written the same value. The duplicate write would be harmless. With an LLM, you have a data consistency problem.

The fix is to move idempotency to the layer above the model. Before invoking the LLM, check a deduplication store keyed by the message's idempotency identifier (SQS provides MessageId natively; for Kafka, you'll construct one from topic + partition + offset). If a result already exists, return the cached value. Only call the model if no result is stored, and persist the output atomically before committing the consumer offset.

This pattern—check, invoke, persist, commit—is non-negotiable for any LLM consumer that produces side effects.

Ordering Ambiguity Is Worse Than You Think

Kafka guarantees ordering within a partition. Across partitions, all bets are off. For deterministic consumers, this is manageable: you partition by entity key (e.g., user_id), and all events for a given user land on the same partition, processed in sequence.

For AI agents, partition key selection becomes a semantic decision, not just a performance one. If your agent maintains conversational state—tracking what the user said in message 3 to inform message 7—then messages for the same conversation must land on the same partition. Use a stable conversation or session identifier as the partition key.

The failure mode when you get this wrong is subtle. Say an agent is generating a multi-step workflow plan. Task events arrive in order: step 1, step 2, step 3. If steps land on different partitions, step 3 might be processed before step 2. The agent generates a plan based on incomplete context. No error is thrown. The pipeline completes successfully. The output is wrong.

This is also why replay semantics deserve explicit testing. Most Kafka issues surface during replays, not during steady-state operation. If you replay from an earlier offset to recover from a bug, your agent will reprocess old events. Any agent that isn't idempotent at the task level will produce duplicate side effects during replay—duplicate tool calls, duplicate writes, duplicate emails sent.

Test your replay path in staging before it becomes an incident response scenario in production.

Backpressure Is Consumer Lag, Not HTTP 429s

In an HTTP-based system, backpressure comes from the server: you return 429 when you're at capacity. The client sees the status code and backs off.

In an event-driven system, backpressure works through consumer lag. The broker doesn't know or care how fast your agent processes messages. It just appends events. Consumers pull at their own pace and commit offsets when they're done. The difference between the latest offset and the consumer's committed offset is the lag.

For LLM workloads, lag is the primary health signal. When your agent is processing within SLA, lag stays bounded. When the LLM API slows down—provider latency spike, rate limit, cold-start—consumers fall behind and lag grows. Upstream producers don't need to know this is happening; they just keep publishing. The queue absorbs the burst.

This is actually an architectural advantage over HTTP. You don't need coordinated backpressure signaling across services. The queue buffers naturally.

The problem is that token cost doesn't slow down automatically. A consumer processing messages faster than your LLM rate limit generates 429 errors from the provider and spends time in exponential backoff—but the queue keeps filling. You need token-aware rate limiting at the consumer level, not just request-level rate limiting.

Concretely: before invoking the LLM, check a sliding-window counter that tracks tokens consumed per minute. If you're near the provider's TPM (tokens per minute) limit, pause consumption and wait. Commit offsets only for messages you've fully processed. Let the lag grow temporarily rather than burning tokens on retries.

Consumer Group Design for Agent Workers

Kafka's consumer group model assigns partitions to consumers dynamically. When a new consumer joins the group, the broker rebalances, reassigning partitions. This gives you horizontal scalability: add more agent workers and the broker automatically distributes load.

For LLM consumers, rebalancing has a catch. If your agent is mid-inference when a rebalance triggers—a slow LLM call that takes 15 seconds while the broker's session timeout is 10—the consumer gets evicted from the group. The partition gets reassigned to another worker, which picks up the message from the last committed offset, processes it again, and you're back to the stochastic-duplicate problem.

Two mitigations:

First, set max.poll.interval.ms to a value that accounts for worst-case LLM latency. If your 99th-percentile LLM call takes 30 seconds, set the interval to 60 seconds with a margin.

Second, fetch small batches per poll. Instead of fetching 500 messages and processing them serially, fetch 5–10, process them, commit, then fetch more. This keeps the commit cadence high and reduces the window of exposure during rebalancing.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates