Skip to main content

Feature Interaction Failures in AI Systems: When Two Working Pieces Break Together

· 10 min read
Tian Pan
Software Engineer

Your streaming works. Your retry logic works. Your safety filter works. Your personalization works. Deploy them together, and something strange happens: a rate-limit error mid-stream leaves the user staring at a truncated response that the system records as a success. The retry mechanism fires, but the stream is already gone. The personalization layer serves a customized response that the safety filter would have blocked — except the filter saw a sanitized version of the prompt, not the one the personalization layer acted on.

Each feature passed every test you wrote. The system failed the user anyway.

This is the feature interaction failure, and it is the most underdiagnosed class of production bug in AI systems today.

Why Individual Tests Don't Find Interaction Failures

The standard mental model for testing an AI system looks like this: write tests for each capability, make sure each one passes, ship. This model works reasonably well when capabilities are independent. It breaks down badly when capabilities share state, execute in sequence, or make implicit assumptions about what the other has already done.

The math is unforgiving. Five components each with 99% reliability produce a system that is only 95% reliable. Add five more, and you're at 90%. But reliability isn't the real problem — silent failure is. In a conventional software system, a 5% failure rate shows up in your error logs. In a composed AI system, that 5% often shows up as outputs that look correct: coherent sentences, valid JSON, zero exceptions thrown. The system believes it succeeded. The user received something wrong.

Research on multi-agent systems has found that naive compositions produce roughly 17 times more errors than single-agent systems, most of them invisible at the component level. The 42% of failures classified as "specification failures" — where the system's behavior diverged from intent without any component technically malfunctioning — are nearly all interaction failures in disguise.

Three Collision Patterns That Bite Production Teams

The catalog of interaction failures is long, but three patterns recur often enough that every team building a composed AI system will eventually encounter at least one.

Streaming + Retry Logic

Streaming and retries make orthogonal assumptions about how a request completes. Retry logic assumes you can re-execute a request if it fails. Streaming assumes that partial output has already been delivered to the client.

The collision happens at the seam. A rate-limit error at token 400 of a 1000-token response means partial content is already displayed in the user's browser. The retry mechanism sees a 429 and re-executes the full request — correctly — but the client now receives a duplicate beginning appended to a truncated first response. Or the retry mechanism doesn't fire at all because the request technically returned 200 before the error surfaced in the stream.

This failure mode is invisible in unit tests because unit tests for retry logic mock a request that fails before any tokens are returned. The streaming integration test verifies that tokens arrive. Neither test constructs the scenario where an error occurs mid-stream during a live retry cycle.

The fix is architectural: decouple generation from the client connection. The LLM call lives in a service layer that can independently retry; the client connection subscribes to the output queue. This means the retry mechanism and the streaming mechanism never touch the same state.

Caching + Freshness Requirements

Caching improves latency and reduces cost. Freshness requirements mean some data must reflect the current state of the world. These two goals are in direct tension, and they interact badly when neither feature knows the other exists.

A support bot with semantic caching answers "what's the status of order #12345?" from a cache entry that was valid 47 minutes ago. The order shipped in the interim. The response is coherent, specific, and wrong. No exception is thrown. The accuracy metric doesn't catch it because it measures whether the response is well-formed, not whether the underlying data is current.

The failure is worse in multi-hop scenarios. An agent retrieves a cached user preference ("ship to office address"), then retrieves a cached address record, then generates a confirmation. Each cached lookup looks correct independently. The user changed their address this morning and is now waiting for a package that will arrive at the wrong building.

RAG systems face a related variant: vector indexes that were accurate at embedding time silently become stale as source documents change. The approximate nearest neighbor retrieval has no concept of freshness — it finds semantically similar documents, not currently accurate ones. Teams building real-time applications discover their nightly re-indexing schedule creates a 24-hour staleness gap only when a user asks about something that changed yesterday.

The interaction audit here asks a simple question for every cache layer: what happens when a user's query requires data that was cached from before a specific event? Enumerate those events (a record was updated, a policy changed, a price changed), then trace every feature that touches those records.

Personalization + Safety Filtering

Personalization and safety filtering are usually implemented as separate middleware layers, and each works correctly in isolation. Personalization customizes responses based on user history or stated preferences. Safety filtering blocks responses that violate policy.

The collision occurs when the two layers operate on different representations of the same request. A personalization layer might transform a request ("answer like a pirate, be casual") before passing it to the model. The safety filter may evaluate the original request, the transformed request, or the final response — and in many implementations, it evaluates whichever representation is cheapest to inspect.

The result: a user's personalization preferences can inadvertently route around safety filters that were evaluated at the wrong point in the pipeline. Or, in the opposite direction, a safety filter evaluates a request that personalization has modified and blocks a response that would have been appropriate for the original intent.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates