Skip to main content

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

· 10 min read
Tian Pan
Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

The Attribution Problem Is Not What You Think It Is

Most engineers approach compound system debugging the way they'd approach a code bug: isolate the problem, fix it, verify. But compound AI pipelines violate a basic assumption that makes this approach work — the assumption that components are independent.

In a retrieval-augmented generation system, the generator's behavior is conditioned on the retriever's output. If you improve the retriever to surface more semantically precise documents, you may also change the distribution of context the generator receives. A generator calibrated on noisier retrieval will behave differently — sometimes worse — when suddenly given cleaner input. You haven't broken anything. You've shifted the joint distribution.

This is why per-component metric improvements can cause end-to-end regressions. It's not a bug in your evaluation methodology. It's a structural property of composed systems: optimizing a component in isolation optimizes for a fixed input distribution that no longer exists after you ship the change.

The research literature on compound AI system optimization has started to formalize this. Studies find that end-to-end performance is often monotonic in how well each module performs — but only when all other modules are held fixed. The moment you start changing multiple things, the interactions dominate.

Why Standard Observability Doesn't Solve This

The typical response to compound system debugging problems is to add more logging. Instrument every stage. Track latency per hop. Build dashboards. This is necessary but not sufficient.

The fundamental issue is that most observability tools are designed to answer operational questions: is the system up, is it fast, is it using resources efficiently? They're not designed to answer attribution questions: which component is responsible for this degradation in output quality?

Quality attribution requires a different data model. You need:

  • A trace ID that connects a user query to every intermediate state: the raw query, the retrieved chunks, the reranked ordering, the final prompt assembled from context, and the generated response.
  • Per-stage quality signals computed against that trace, not just latency and error rates.
  • A way to reproduce any historical pipeline state so you can isolate components in post-hoc analysis.

Without the full trace, you can tell that something went wrong. You cannot tell where. And in a live system processing thousands of queries a day, "something went wrong somewhere" is not actionable.

Per-Stage Metrics That Actually Attribute Failures

The goal of per-stage evaluation is to answer a precise question at each stage: given what this component received as input, did it produce good output?

Retrieval stage. The retriever fails when it doesn't surface the documents needed to answer the query. Relevant metrics here are retrieval recall (did the right documents appear at all?), mean reciprocal rank (how high in the list were they?), and context relevance (are retrieved chunks semantically appropriate to the query?). A retrieval failure looks like: high answer quality on queries where relevant documents happen to rank first, low answer quality elsewhere. The smoking gun is low recall on a known golden set of queries with ground-truth relevant documents.

Reranking stage. The reranker's job is to improve precision at the top of the list. Its failure mode is subtler: it can appear to improve average ranking scores while actually hurting performance on the long tail of unusual queries, because most rerankers are trained on distribution-typical examples. Per-stage metric: rank correlation between reranker scores and downstream answer quality. If this is near zero, your reranker is scoring something other than what matters.

Generation stage. The generator fails in two distinct ways that are often conflated. Hallucination failure: it generates content not supported by the provided context. Instruction-following failure: it ignores the context entirely and falls back on parametric knowledge. These require different fixes — one is a prompting problem, the other is a model calibration problem. Faithfulness metrics (does the answer cite the provided context?) and groundedness metrics (can every factual claim be traced to a source chunk?) separate them.

Validation stage. If your pipeline includes a validator or self-critique step, it can introduce its own failure mode: false negatives that suppress correct answers, or false positives that approve hallucinated content. Validator accuracy against a human-labeled evaluation set is the relevant metric here, and it's frequently skipped.

The Ablation Methodology

Once you have per-stage metrics, you can run targeted ablations to assign credit for end-to-end quality changes.

The basic approach: for a fixed query set, run the pipeline with one component replaced by a gold-standard oracle. If swapping out retrieval for perfect retrieval (i.e., manually selected ground-truth documents) eliminates most of the quality gap, retrieval is your bottleneck. If it doesn't, retrieval isn't the problem — even if its metrics look bad.

This oracle ablation pattern is borrowed directly from controlled experiments in systems research. It's more expensive than just reading dashboards, but it's the only method that produces unambiguous attribution. The key practical requirement is that you need a held-out evaluation set with human-verified quality labels at the end-to-end level. Without ground truth, you're comparing metrics to metrics, and Goodhart's Law applies: every metric will be optimized eventually, including the wrong ones.

A more systematic version of this approach involves testing components under perturbation. Apply controlled degradations to one component's output — inject irrelevant documents into the retrieval results, lower the reranker confidence scores by a fixed amount, introduce generation errors at a specific rate — and measure how much end-to-end quality moves. The component with the highest sensitivity is where engineering effort returns the most value.

The Component-Improvement Paradox

There is a non-obvious failure mode that trips up teams with sophisticated per-stage metrics: you correctly identify the weakest component, improve it, and end-to-end quality still doesn't improve — or gets worse.

This happens for several reasons.

Distribution shift. As described above, improving one component changes the input distribution for the next. A generator trained or prompted for a specific context quality level will behave differently when that quality level changes.

Evaluation set mismatch. Your per-stage metrics may be measuring performance on a representative sample of queries, but the queries where end-to-end quality actually degraded are in a different part of the distribution. You improved the component on average while making it worse on the tail that matters.

Interaction effects. Some component combinations produce emergent behaviors. A specific retriever-reranker pair may work well because the reranker was implicitly compensating for retrieval errors. Improve the retriever and the reranker's learned compensation becomes a liability.

The practical response to this is to always validate component-level improvements by measuring end-to-end quality on a diverse evaluation set before shipping. Per-stage metrics are diagnostic tools, not shipping criteria. A component improvement that doesn't move the end-to-end needle is a candidate for deprioritization, regardless of how good the component metrics look.

A Practical Attribution Workflow

Given the above, a workable attribution workflow for a compound AI system in production looks like this:

  1. Detect the regression. An end-to-end quality signal drops — either a user-facing metric or an automated evaluation. This is your trigger.

  2. Segment by pipeline path. Different queries may take different paths through the system (e.g., some bypass reranking, some use different retrieval indices). Segment the affected queries by which pipeline configuration they used. If one segment is disproportionately affected, you've narrowed the search space immediately.

  3. Run oracle ablations. For the affected query set, replace one component at a time with a gold-standard version. Measure end-to-end quality after each replacement. Rank components by how much quality recovers when they're replaced.

  4. Measure per-stage metrics on the affected queries. Compare per-stage metrics for the affected query set against the baseline. Look for stage-level metrics that diverged from the baseline more than others.

  5. Cross-reference. If oracle ablations and per-stage metrics point to the same component, you have high-confidence attribution. If they disagree, you likely have an interaction effect — investigate the component boundary between the two stages that are implicated.

  6. Ship the fix with end-to-end validation. After fixing the attributed component, validate on the held-out evaluation set before releasing. Track both per-stage metrics and end-to-end quality simultaneously to catch distribution shift early.

What Teams Get Wrong

The most common mistake is treating per-stage metrics as a substitute for end-to-end evaluation. They're not. They're a debugging tool that helps you understand which component to look at after end-to-end quality degrades. Running per-stage metrics in isolation, without a corresponding end-to-end signal, tells you how components perform in the laboratory conditions of your evaluation set, not in the wild.

The second mistake is building attribution infrastructure after the first serious incident. At that point, you're doing forensics on logs that weren't designed to answer attribution questions. The traces are incomplete, the evaluation set doesn't exist, and the per-stage metrics were never instrumented. The right time to build attribution infrastructure is when the system is working well and you have time to think carefully about what you'd need to know if it broke.

The third mistake is assuming that the component with the worst individual metrics is always the highest-priority fix. In a composed system, the value of improving a component depends on what happens to the components downstream of it. A component that looks mediocre but whose output is robust to variation may be less important to fix than a component with good metrics but high sensitivity to input quality.

Building for Attributability from Day One

The engineering implication is that attributability should be a first-class design requirement for compound AI systems, not an afterthought. This means generating trace IDs at query ingestion and propagating them through every component, logging intermediate state (retrieved chunks, reranker scores, assembled prompts) alongside final outputs, building and maintaining a held-out evaluation set with end-to-end quality labels, and instrumenting per-stage quality metrics in production from the first deployment.

None of this is technically complex. It's discipline. The systems that are hardest to debug in production are rarely the ones with the most sophisticated components — they're the ones where the infrastructure to ask "which part broke?" was never built.

When your end-to-end metrics move, you want to spend your time fixing the problem, not figuring out where to look.

References:Let's stay in touch and Follow me for more thoughts and updates