AI Feature Dependency Graphs: Resilience Engineering When Your Services Share a Model
Your embedding model goes down at 3 PM on a Tuesday. Within thirty seconds, your support chat stops answering questions, your personalized recommendation engine starts returning empty results, your document search returns nothing, and your onboarding assistant stops working. Your on-call engineer opens the incident channel and sees fifteen simultaneous alerts from features that have no visible relationship to each other. There is no stack trace pointing to the root cause. It looks like a distributed systems outage — but it isn't. It's a single shared dependency failing, and you didn't know fifteen features shared it.
This is the AI feature dependency problem: the infrastructure layer underneath your product features is deeply interconnected, but your architecture diagrams show each feature as an isolated box. When the coupling is invisible, failure propagation is invisible too — until it isn't.
The Hidden Coupling Problem
Traditional distributed systems have well-understood coupling patterns. Microservices call each other via HTTP or message queues, and service maps make the relationships explicit. When service A depends on service B, the dependency is in the code and the config.
AI features couple through infrastructure, not through code. Your search feature, your chat assistant, and your content recommendations might have no shared code — separate repos, separate teams, separate deploys. But if they all call the same embedding model endpoint to vectorize queries, they are coupled. When the embedding provider degrades, all three degrade together.
The coupling is compounded by how AI features are assembled. A typical AI feature stack looks like this: user input → embedding model → vector store → reranker → LLM → response. Fifteen product features sharing this stack means fifteen failure blast radii converging on the same points: one embedding endpoint, one vector database cluster, one LLM provider. An incident at any of those points propagates to all fifteen features simultaneously.
What makes this worse is the economics. LLM requests are expensive and slow. When a provider degrades, retry logic kicks in automatically — and each retry replays multi-thousand-token requests. In traditional systems, retries add load to a struggling service. In AI systems, retries also burn your monthly token budget while producing nothing. Cascading failures in AI systems cost money, not just uptime.
Mapping What You Own
The first step to managing AI dependencies is making them visible. This is harder than it sounds because the dependencies live in configuration and runtime state, not in call graphs.
Start by inventorying every AI component your product uses: embedding model providers and versions, vector store clusters, LLM providers, model-specific API keys, and reranking services. Then map which product features touch each component.
The result is an AI feature dependency graph. At the center are the AI components; at the edges are product features. A well-drawn graph will immediately reveal your highest-risk components — the nodes with the most feature connections. Those are your single points of failure.
The graph will likely surprise you. Features owned by different teams that have never been considered related will share infrastructure components. A helpdesk summarization feature and a search ranking feature probably have no code in common but may depend on the same embedding model endpoint.
Keep the graph current. The most common error teams make is treating the dependency graph as a one-time architecture exercise. As features are added and providers change, the graph becomes stale and the visibility it provides disappears. Treat it as living documentation, updated whenever a new feature is added or a provider changes.
Circuit Breakers at the Right Layer
Circuit breakers are the standard pattern for preventing cascading failures in distributed systems. The core mechanism is simple: track failure rates on calls to a downstream dependency; when failures exceed a threshold, stop sending traffic to that dependency and return failures immediately until the dependency recovers.
Applying circuit breakers to AI infrastructure requires identifying the right layer for each breaker. There are three distinct layers in a typical AI feature stack, and they need separate circuit breakers.
The inference layer handles calls to the LLM. Open the circuit when error rates exceed five percent or when P95 latency exceeds your SLA — typically thirty seconds for most applications. A key difference from traditional circuit breakers: track token consumption in failed requests. When the inference circuit is open, you want to fail fast specifically to avoid token waste. Per-model breakers are better than per-provider breakers; different models from the same provider can have independent failure characteristics.
The retrieval layer handles calls to vector stores. This layer has different failure modes: timeouts, stale index errors, and quality degradation (returning low-relevance results). Open the retrieval circuit independently from inference. A retrieval timeout should not block an inference call — the LLM can often produce a useful response with base knowledge when retrieval fails. Treat retrieval and inference as independently breakable.
The embedding layer handles calls to the embedding model. Embedding failures are particularly dangerous because they affect both ingestion (adding new content) and query-time retrieval. When the embedding circuit opens, cached embeddings from recent queries can continue serving most retrieval traffic. The circuit should open quickly because embedding failures are the root cause that takes down both retrieval and any downstream inference that depends on retrieval.
The half-open state matters for AI systems more than traditional services. When a circuit re-opens after a timeout, let through only one or two test requests before fully closing. LLM providers in partial degradation can return successful responses intermittently; a premature circuit close under partial failure will keep tripping.
Graceful Degradation: Fail Features, Not Outputs
The principle that separates well-designed AI systems from fragile ones: when a dependency fails, fail explicitly rather than corrupt silently.
A search feature that returns no results is a failed feature. A search feature that returns wrong results confidently is corrupted output. Users can recover from a failed feature by trying again later or switching to a manual workflow. Corrupted output causes them to make decisions based on wrong information, and they won't know to correct it.
Design a degradation chain for each AI feature. The chain has multiple levels, and each level degrades capability while preserving output integrity.
For a RAG-backed chat assistant, the chain might look like this: at the top, full retrieval plus inference with complete context. One level down, skip retrieval and run inference with base model knowledge only — appropriate when the vector store is unavailable. Another level down, return a static fallback response acknowledging the capability is temporarily reduced. At the bottom, disable the feature entirely with a clear user-visible message.
For a search feature: full semantic search at the top, keyword search as a fallback when the embedding service is down, static popular results as a further fallback, and feature disable at the bottom.
The critical rule: never let a fallback level return confident output when it cannot guarantee quality. If inference is running with incomplete context because retrieval failed, either surface that uncertainty to the user or disable the feature. The moment you return an answer that looks complete but is actually missing information, you've converted a failure into a corruption.
Treating degradation paths as first-class concerns means instrumenting them. Count how often each fallback level activates. Alert when fallback usage is elevated. High fallback usage is an early signal that something is wrong at the dependency level — often before the primary circuit opens.
Avoiding Hidden Coupling Through Isolation
The most durable solution to AI feature dependency failures is reducing shared coupling in the first place. Some coupling is unavoidable — you cannot eliminate shared providers — but you can reduce scope.
Per-feature provider quotas prevent one feature's traffic spike from exhausting shared provider rate limits. Assign explicit quota allocations across features: forty percent to search, forty percent to chat, twenty percent for everything else. When a feature is over-quota, apply back pressure — queue or drop requests from that feature — rather than allowing it to consume capacity from other features.
Provider abstraction layers allow routing to secondary providers when primary ones fail. Tools like LiteLLM expose a unified API across multiple LLM providers. A secondary provider fallback requires advance setup — model selection, capability mapping, cost modeling — but pays off immediately during any provider incident. The fallback is already configured; the gateway switches automatically when the circuit opens.
Dedicated endpoints for critical features trade cost for isolation. Most managed inference providers offer dedicated endpoints that provide reserved capacity and SLA guarantees, separate from shared capacity pools. For features where degradation is particularly costly — customer-facing chat, payment workflows, anything time-sensitive — dedicated endpoints provide hardware-level isolation that shared endpoints cannot.
Embedding model version pinning prevents a subtle failure mode: if you allow the embedding model to update without controlling when it rolls out, new embeddings become incompatible with existing indexed embeddings, causing silent retrieval quality degradation. Pin model versions explicitly. When upgrading, re-embed your corpus, validate retrieval quality with test queries, then cut over atomically. Treat embeddings as derived data that can be regenerated from source — storing them as primary data makes model upgrades permanently risky.
Building the Incident Response Muscle
Dependency graph visibility and circuit breakers change the shape of AI infrastructure incidents. When an embedding provider degrades, the on-call engineer can look at the dependency graph, see which features depend on the embedding layer, confirm that those circuits have opened, and verify that each feature's degradation chain has activated correctly. The investigation starts with the dependency graph, not with fifteen simultaneous alert notifications.
After each incident, update the dependency graph to reflect what you learned. The incidents that reveal the most hidden coupling are the ones where features you didn't expect to share a dependency failed together. Capturing those relationships in the graph prevents the same surprise on the next incident.
Regularly test your degradation chains. Circuit breakers and fallback logic only work if they have been exercised. Add chaos testing that deliberately opens each circuit and verifies that the correct fallback activates. Features that have never had their fallback triggered have almost certainly drifted out of correctness — the fallback code exists but hasn't been tested against the current state of the feature.
The Broader Pattern
The AI feature dependency problem is a new instance of a pattern that distributed systems engineers have dealt with for decades: invisible coupling creates failure cascades, and explicit dependency management is the only durable solution.
What's new with AI infrastructure is the economics. Token costs make retry storms more expensive. Non-determinism makes failure signatures harder to identify. Model version drift creates silent quality degradation that doesn't show up as errors. The standard distributed systems playbook — circuit breakers, bulkheads, fallbacks, observability — applies, but it needs to be applied at layers that didn't exist before: embedding generation, vector retrieval, prompt assembly, inference routing.
Teams that survive AI infrastructure incidents are the ones who mapped their dependencies before an incident forced them to, built degradation chains with explicit failure modes rather than optimistic error handling, and tested their fallback logic before they needed it. The dependency graph is where that work starts. Build it now, not after the first time fifteen features go down simultaneously with no stack trace.
- https://www.willvelida.com/posts/preventing-cascading-failures-ai-agents
- https://markaicode.com/circuit-breakers-llm-api-reliability/
- https://markaicode.com/implement-graceful-degradation-llm-frameworks/
- https://arxiv.org/html/2501.12469
- https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/
- https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/rel_mitigate_interaction_failure_graceful_degradation.html
- https://learn.microsoft.com/en-us/azure/architecture/patterns/bulkhead
- https://bentoml.com/llm/infrastructure-and-operations/multi-cloud-and-cross-region-inference
- https://latitude.so/blog/how-to-design-fault-tolerant-llm-architectures
