Skip to main content

Your AI Feature Is Only As Reliable As The ETL Pipeline Nobody Owns

· 10 min read
Tian Pan
Software Engineer

The AI feature has the dashboard. The prompt has the version control. The eval suite has the on-call rotation. And then there is the upstream cron job, written in 2022, owned by a team that rotated out of analytics two reorgs ago, that produces the CSV your retrieval index is built from. That cron job has no SLA. That CSV has no schema contract. The team that owns it does not know it feeds an AI feature. When it changes — and it will change — the AI team will spend three weeks debugging a prompt that did nothing wrong.

The AI quality regression you are about to chase is almost never an AI problem. It is an ETL problem wearing an AI costume. The discipline that has to land is the seam between the two — the contract, the lineage, the freshness signal, the paired on-call — and the team that does not formalize it ships an AI feature whose reliability is bounded by the least-loved cron job in the company.

The invisible dependency

An AI feature is a pipeline. The last 20 percent of that pipeline — the prompt, the model, the eval harness — is where the engineering investment is concentrated, and where every retrospective begins. The first 80 percent — the ingestion job, the normalization step, the deduplication pass, the daily snapshot, the column that gets joined in from a third upstream system — was built before anyone said the word "agent" in a meeting. It is owned by people who think they own a data warehouse for analytics consumers. They do not know an LLM is downstream.

This invisibility is the failure mode. The data team treats their pipeline like a Tableau backend: a column rename is a routine cleanup, a daily run that slips to every-other-day is an acceptable degradation, a schema "improvement" is a Slack announcement to the analytics channel. None of those communications reach the AI team, because the AI team was never enrolled as a consumer. There is no contract that says the embedding pipeline depends on customer_segment being a string and not an integer. There is no consumer registry that pages the data team when a downstream RAG index is reading their output.

The AI team, meanwhile, treats the upstream as ground truth. Their evals are run against a snapshot of the data taken at some point in the past. Their retrieval works because the columns are where they expect them to be. Their fine-tune was trained on a distribution that they assume is stationary. Every single one of those assumptions is a contract that was never signed, and the upstream team is free to violate every one of them this afternoon, because nobody told them they had agreed to anything.

The failure modes that get logged as "model regressions"

The pattern repeats so often it is almost a genre. The AI team notices that quality is down four points on the weekly eval. Latency is fine, error rates are fine, the model version did not change, the prompt did not change. They spend a week tuning the prompt. They spend another week trying a different chunking strategy. They spend a third week running ablations against the retrieval pipeline. Eventually somebody traces a sample of bad outputs back to a specific document, finds the document, looks at when it was last ingested, and discovers that the upstream pipeline started filtering out a category of records two weeks ago because of a "harmless cleanup" that removed records flagged as "internal." The AI feature was relying on those records.

A second pattern: the upstream pipeline begins emitting a column with a different precision. A timestamp that used to be milliseconds is now seconds. The retrieval layer was using the timestamp to break ties on relevance ranking. Suddenly the tie-breaking is non-deterministic, the same query returns different documents on different days, and the eval suite begins to oscillate. The model is fine. The retrieval is fine. The data is one decimal off.

A third: the upstream pipeline's run cadence drops from hourly to daily, because the cost-cutting initiative deprioritized non-critical jobs. The RAG index is now stale by up to 24 hours. The AI feature begins answering questions about "the latest" with information that is a day behind. No alert fires anywhere — the pipeline ran successfully, the index was updated successfully, the model responded successfully — and the only signal is that customer satisfaction quietly drops over a quarter.

A fourth: the upstream pipeline silently truncates a long string field at 256 characters because somebody changed the warehouse column type. The RAG index now contains chunks that are missing the second half of every long document. Retrieval still returns chunks. The chunks are just incomplete. The model answers based on incomplete context. Hallucination rates rise. The team blames the model.

In every one of these cases, the AI team's first three theories were about the model. None of those theories were correct. The fourth theory, eventually, was about the data. The data is almost always the answer, and almost never the first place anyone looks.

The contract layer that has to exist

The seam between the upstream pipeline and the AI feature has to become a first-class artifact, and the artifact is a data contract. A contract is not a Slack message and not a wiki page; it is a versioned, machine-checkable specification that names the schema, the freshness, the completeness, and the change policy that the producer commits to and the consumer depends on.

The schema clause names every column the AI feature reads, with its type, its allowed values, and its semantics. A renamed column is a contract violation. A widened type is a contract violation. A new nullable column is fine; a column that flips from non-nullable to nullable is a contract violation. The producer cannot make these changes without bumping the contract version, and bumping the contract version triggers a notification to every registered consumer.

The freshness clause names the maximum age of the data the AI feature is allowed to see. "Updated at least every four hours" is a freshness SLA. The pipeline that violates it is a pipeline that has broken its contract, regardless of whether the records inside the file are still correct. The consumer is allowed — required — to refuse to serve context that exceeds the freshness threshold, and to surface the staleness to the user rather than silently answer over old data.

The completeness clause names the percentage of expected records that must arrive. "At least 98 percent of source records must appear in the output, with the missing 2 percent attributable to documented filters" is a completeness SLA. A run that meets a 95 percent threshold has violated the contract even though it produced output. Silent truncations are the most common upstream-induced AI regression, and they are exactly what a completeness clause exists to catch.

The change-policy clause names how the producer is allowed to evolve the contract. "Backwards-incompatible changes require 30 days notice and a contract version bump; minor additions can be made any time" is a change policy. Without it, every change is a coordination problem solved over Slack, and the coordination eventually fails because the AI team is not on the right channels.

Lineage, ownership, and the paired on-call

A contract is not enough on its own. A contract is a promise; lineage is the map of who has promised what to whom. A data lineage map traces every byte feeding the AI feature back to the upstream pipeline that produced it, and names the team that owns that pipeline as the responsible party. Without lineage, the AI team cannot even identify whose pipeline broke when their feature regresses; they have to forensically reconstruct the path from "bad output" to "bad column" to "bad job" to "bad team," and each step takes hours.

Lineage turns the AI feature into a registered, visible consumer. The upstream team can see, on a dashboard, that their daily_customer_facts table feeds three downstream consumers, one of which is the AI support agent. The upstream team is now socially and operationally aware that "harmless cleanups" are not harmless. The consumer is no longer invisible.

The next step is paired on-call. When the AI feature regresses and the root cause is upstream, the upstream team is paged. Not in addition to the AI team — instead of, when the root cause is on their side. The AI team's runbook for quality regressions begins with "check the upstream contract status" and routes the page to the upstream owner if a contract is violated. This sounds bureaucratic until it has happened twice and the upstream team has internalized that their cron job has a customer who pays in pages, not just in dashboards. The behavior change is fast.

A freshness-aware retrieval policy is the consumer-side complement. The retrieval layer knows the timestamp on every document it serves, knows the freshness SLA of the source, and refuses to serve context that has gone stale. It is allowed to surface staleness to the user — "the most recent data I have for this is from 36 hours ago; I can answer with that, but flag if you need fresher" — rather than silently degrade. This converts a silent failure into a loud one, which is the only kind of failure a team can act on.

The architectural realization

The AI feature is the product of two pipelines stitched together at a seam that almost no organization makes visible. The upstream pipeline is a 2010s artifact: built for analytics, owned by a data team, operated like infrastructure. The downstream feature is a 2020s artifact: built for inference, owned by an AI team, operated like a product. The seam between them is a 2025 problem that nobody owned in either decade, and the team that does not formalize it ships an AI feature whose reliability ceiling is set by the least-instrumented step in the upstream chain.

The discipline that has to land is not glamorous. It is data contracts with teeth, lineage maps that name owners, freshness signals that propagate through the retrieval layer, and on-call rotations that follow the data flow rather than the org chart. None of this is research. All of it is the unglamorous engineering that sits between a working AI demo and a working AI product. The teams that have shipped reliable AI features at scale are the teams that have a contract on the table when the upstream data team proposes a "minor cleanup," and a paged on-call when the upstream team forgets the contract exists.

The next time your AI feature regresses, do not start by tuning the prompt. Start by reading the lineage map and asking what changed upstream this week. The answer is almost always there. And the prompt you were about to spend three weeks rewriting was, almost certainly, doing exactly what it was supposed to do — over data that had quietly stopped being what your eval suite assumed it was.

References:Let's stay in touch and Follow me for more thoughts and updates