Data Lineage for AI Systems: Tracking the Path from Source to Response
A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.
This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.
Most production RAG systems instrument the easy things: latency per request, token counts, API error rates. None of that tells you what the model actually read before generating an answer. Without retrieval provenance — knowing which chunks were fetched, from which documents, with what relevance scores — you cannot debug semantic failures, cannot satisfy compliance audits, and cannot systematically improve retrieval quality.
The good news is that adding lineage doesn't require rebuilding your stack. The patterns are lightweight, the tooling has matured significantly, and OpenTelemetry now has standardized semantic conventions for generative AI. The hard part is recognizing that you need lineage before you need it, rather than after a production failure forces the issue.
The Three Places Lineage Actually Gets Lost
Understanding where provenance disappears tells you where to instrument.
The retrieval boundary is the most critical and most often skipped. When your vector database returns five document chunks, your application typically extracts just the text, drops the source metadata, and passes a concatenated string into the LLM prompt. The document IDs, relevance scores, timestamps, and source metadata that the vector DB returned are silently discarded. From this point on, even if you have perfect tracing of the LLM call, you cannot connect any part of the response back to a specific source document.
The context assembly step is where multiple retrieved chunks get merged, deduplicated, truncated, and formatted before reaching the model. If this step isn't traced, you lose visibility into what was actually fed into the context window — not what you intended to feed, but what actually arrived. This matters because truncation order and chunk ordering affect model behavior, and you need to know the exact context to reproduce a failure.
The generation-to-claim mapping is the hardest problem. The model may synthesize information from three different retrieved chunks into a single sentence. Without claim-level attribution — mapping each fact in the output back to its source evidence — you cannot tell whether a specific claim was grounded in retrieved context or fabricated. This is precisely the gap that makes "the model hallucinated" the end of the debugging chain rather than the beginning.
What Lightweight Lineage Actually Looks Like
The minimal viable lineage implementation adds three things: document identifiers in the context, a retrieval span that records what was fetched, and post-generation grounding verification.
Document tagging in the context prompt is the simplest change. Instead of inserting raw chunk text, wrap each chunk with a source identifier:
[DOC-ID: doc_4821, chunk 3, score: 0.87]
The contract renewal deadline for enterprise accounts is February 28th.
[/DOC-ID]
This doesn't require external tooling. When a user or auditor asks "where did that come from?", you can search the trace for the cited document ID and pull the original source.
Retrieval spans in your tracing pipeline record what the retrieval step actually returned. A retrieval span should capture: the query sent to the vector DB, the number of results returned, document IDs and relevance scores, and the metadata filters applied. OpenTelemetry's GenAI semantic conventions now standardize attributes for model name, token counts, and finish reason — retrieval steps should be instrumented as child spans within the same trace, maintaining parent-child relationships across the full pipeline.
Tools like Langfuse, Arize Phoenix, and LangSmith each support this pattern. Langfuse's @observe() decorator takes three lines to instrument: install the SDK, set API keys, wrap your RAG function. The trace then shows a nested structure: the top-level request contains a retrieval span, a context assembly span, and one or more LLM spans, each with timing and payload data. This is what lets you open a failed trace and see exactly what documents were retrieved, how they were assembled, and what the model received.
Post-generation grounding verification breaks the response into individual claims and checks each against the retrieved context. This is the approach that makes hallucination detection systematic rather than anecdotal. For each generated claim, the grounding check asks: is this supported by the retrieved text? If not, flag it as a potential hallucination. Token Probability Attribution (TPA) takes this further by mathematically attributing each output token's probability to seven sources — query, context, past tokens, self tokens, FFN, LayerNorm, and initial embedding — enabling fine-grained localization of where unsupported content originates.
You don't need TPA to start. A simpler LLM-based grounding check — pass each claim and the retrieved context to a small model and ask "is this claim supported by the provided text?" — gets you most of the practical value at minimal latency cost.
Using Lineage to Debug "Where Did That Answer Come From?"
Lineage shifts debugging from "the model hallucinated" (a dead end) to a structured diagnostic workflow.
Step one is correlating the failure to a specific trace. When a user reports a wrong answer, you need to reconstruct the exact execution: what was the query, which chunks were retrieved, what was in the context window, and what the model returned. Most observability platforms let you search traces by user ID, session, or a semantic fingerprint of the query. Without this, you're relying on application logs that almost certainly don't include the retrieval payload.
Step two is distinguishing retrieval failure from generation failure. These are different problems with different fixes.
A retrieval failure means the correct document wasn't in the top-k results. Indicators: the grounding check shows the wrong answer is unsupported by any retrieved chunk; the relevant document exists in your corpus but wasn't returned. Fix: tune embedding similarity thresholds, adjust metadata filters, or improve chunking boundaries so relevant content doesn't get split across chunks at semantic boundaries.
A generation failure means the correct information was retrieved but the model didn't use it accurately — it contradicted the source, averaged across conflicting chunks, or defaulted to a prior belief. Indicators: the grounding check shows the wrong answer does appear in retrieved context, but it's misattributed, misread, or contradicted by another retrieved chunk. Fix: reduce context noise (fewer irrelevant chunks), adjust chunk ordering, or add explicit instructions about how to handle conflicting information.
Step three is root-causing retrieval quality issues systematically. A one-off debugging session shows you one failure. Lineage-powered analytics show you patterns: which document sources have disproportionately low relevance scores, which query patterns consistently retrieve poor results, which chunk sizes correlate with higher grounding verification failures. Embedding drift — where retrieval quality degrades gradually as your corpus grows — is invisible without per-retrieval relevance score tracking over time.
The W&B Wandbot team ran this playbook on their LLM documentation assistant. By adding span-level tracing via W&B Weave decorators and examining the data flow through intermediate steps, they identified retrieval quality issues that weren't visible in aggregate metrics. Accuracy improved from 72% to 81%, and end-to-end latency dropped 84% — not by changing the model, but by understanding what the model was actually receiving.
Compliance Audit Trails Are a Different Problem Than Debugging
Debugging lineage and compliance lineage overlap in implementation but differ in requirements.
Debugging lineage needs to be fast and queryable: you're looking at a specific failure, you need to pull the trace for a specific session, and you need to drill into the retrieval payload. Retention can be short — a few weeks covers most post-incident investigations.
Compliance lineage needs to be immutable and comprehensive: you're proving to an auditor (or a regulator) that a specific response was generated from specific sources on a specific date, that no sensitive data was accessed without authorization, and that the system operated within documented policy. The EU AI Act's requirements for high-risk AI systems include explainability, risk management, and audit trail provisions. GDPR's right to erasure creates a harder problem: if a user's document is deleted, can you trace every response that was generated from that document and flag it for review?
The practical answer is that compliance lineage requires storing document IDs with responses, not just retrieval traces. You need a mapping from response ID to the set of source document IDs that contributed to it, with timestamps. This lets you run a query: "given that document X was deleted, which responses between these dates may have included content from it?" Without this mapping, you cannot answer that question without rerunning every query from the period, which is computationally intractable at any real scale.
Agentic systems make this harder. A single agent run may spawn multiple sub-agents, make a dozen retrieval calls, execute code, and chain results across steps. The audit trail must capture the full decision tree — not just the final response, but which retrieval results influenced which tool calls, which intermediate reasoning steps were based on which sources. This is why distributed tracing with parent-child span relationships is the right foundation: each tool call, retrieval step, and LLM invocation is a child span, and the full trace graph is the lineage record.
What to Build First
The priority order for adding lineage to a production RAG system is driven by which failures are most expensive.
If hallucination debugging is your most pressing problem, instrument retrieval spans and add document tagging in the context prompt this week. Run grounding verification asynchronously on a sample of production traffic. You don't need real-time grounding checks on every request — a 5% sample gives you enough signal to identify systematic problems without adding latency to the critical path.
If compliance is the driver, the document-ID-to-response mapping table is the first thing to build, before any trace visualization. A simple append-only log: {response_id, document_ids[], timestamp, query_hash} gives you the core of the audit trail. Trace visualization can come later; the provenance mapping needs to be there from the start because you cannot reconstruct it retroactively.
If retrieval quality analytics is the goal, add relevance score logging to your retrieval spans and build a dashboard over rolling time windows. Embed quality metrics — Precision@k, Recall@k, MRR — into your CI pipeline as part of eval runs so you detect drift before it reaches production.
The one thing that's genuinely difficult to retrofit is claim-level attribution across long-lived agent conversations. If your agent has a 90-minute session history and you want to trace a specific claim back to a specific retrieval event from turn 7, you need the full session trace with retrieval payloads stored. Start capturing this early; reconstructing it from sparse logs after the fact is rarely possible.
The Cost of Waiting
The teams that build lineage from the start don't spend much time thinking about it — it's just part of the pipeline. The teams that skip it spend their incident response time staring at logs that tell them that something went wrong but not why.
"Where did that answer come from?" is a question that every production AI system will eventually have to answer. The architecture decision is whether you build that answer into your observability layer from the start, or spend three engineering-weeks reconstructing it after a user complaint that becomes a legal issue.
Data lineage isn't the most glamorous part of building AI systems. But it's the part that determines whether you can actually improve them systematically once they're live — and whether you can defend them when something goes wrong.
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://www.datadoghq.com/blog/llm-otel-semantic-convention/
- https://opentelemetry.io/blog/2024/otel-generative-ai/
- https://www.langchain.com/langsmith/observability
- https://langfuse.com/blog/2025-10-28-rag-observability-and-evals
- https://phoenix.arize.com/llm-tracing-and-observability-with-arize-phoenix/
- https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/
- https://towardsdatascience.com/detecting-hallucination-in-rag-ecaf251a6633/
- https://arxiv.org/html/2512.07515
- https://wandb.ai/site/articles/llm-observability/
- https://www.solidatus.com/blog/why-data-lineage-is-essential-for-ai-7-governance-challenges-solved-by-ai-ready-lineage/
- https://www.evidentlyai.com/llm-guide/rag-evaluation
- https://unstructured.io/insights/how-to-use-metadata-in-rag-for-better-contextual-results/
- https://towardsdatascience.com/agentic-rag-failure-modes-retrieval-thrash-tool-storms-and-context-bloat-and-how-to-spot-them-early/
- https://developers.redhat.com/articles/2026/04/06/distributed-tracing-agentic-workflows-opentelemetry
- https://blog.langchain.com/end-to-end-opentelemetry-langsmith/
- https://uptrace.dev/blog/opentelemetry-ai-systems
