Skip to main content

Upstream Data Quality Is Your AI Agent's Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

A team spent three months tuning prompts for their knowledge agent. They tried GPT-4, then Claude, then a fine-tuned model. They rewrote the system prompt six times. They hired a prompt engineer. The agent kept hallucinating — confidently, fluently, and wrong. The actual problem turned out to be a Confluence export from 2023 sitting in the vector store alongside a Slack archive full of contradictory, casual half-opinions about the same topics. The model was doing exactly what it was supposed to do: synthesizing the information it was given. The information was garbage.

Over 60% of AI project failures in production trace to data quality, context problems, or governance failures — not model limitations. Yet when agents misbehave, the first instinct is almost always to touch the prompt. The second instinct is to switch models. The third might be to add a reranker. The upstream database that feeds the whole pipeline rarely makes the troubleshooting list until months of work have been wasted.

Why the Model Gets the Blame

LLMs are probabilistic and their failures look like reasoning errors. When a customer support agent cites a pricing policy from two quarters ago, it feels like a hallucination. When a document assistant conflates two different products with similar names, it looks like a context window problem. When an extraction agent returns null for a required field 15% of the time, it reads like model inconsistency.

These diagnoses are usually wrong. The pricing policy failure happens because three versions of the same document exist in the vector store with no timestamp metadata. The conflation happens because two products were merged last year and the description fields were never reconciled. The null extraction happens because 15% of the source records have missing values that were never surfaced in any monitoring dashboard.

The model is synthesizing its inputs faithfully. The inputs are the problem.

This matters because the remediation paths are completely different. If the model is at fault, you tune the prompt, add examples, or upgrade. If the data is at fault, you need freshness validation, deduplication, schema enforcement, and completeness monitoring — none of which a better model will give you.

The Failure Modes That Look Like Model Errors

Null propagation. A source database has 15% null values for a product category field. The agent retrieves records, encounters the nulls, and either fabricates plausible-sounding category names or silently drops those records. The eval catches neither: the fabricated values pass format checks, and the dropped records don't surface as errors. You discover the problem when a downstream report shows an unexplained gap.

Duplicate records with conflicting state. An order management system has a customer appearing in three records — one from a legacy import, one from a self-service signup, and one created by a support agent. The agent tasked with summarizing customer history synthesizes all three, producing a contradictory profile. The model isn't confused; it was handed contradictory source material.

Stale documents in retrieval corpora. A RAG system indexes documentation. The source documents get updated, but the index refresh runs nightly and only picks up new files — not edits to existing ones. Support tickets start reflecting last year's features. Adding metadata recency filters doesn't help because the files don't have reliable last-modified fields in the index.

Schema drift between training and inference time. An agent was built against a database schema where status is an enum with four values. A migration added a fifth value and a backfill partially completed — meaning 40% of recent records have the new value and 60% have one of the old ones. The agent was never updated to understand the new value and silently misclassifies 40% of recent items.

Inconsistent field definitions across sources. The word "revenue" appears in four different tables and means something slightly different in each — one is gross, one is net, one excludes refunds, one is collected vs. recognized. An analytics agent asked to report "revenue" picks whichever source retrieves highest in the semantic search. Which one that is depends on which recent documents talked about revenue in a similar context to the query.

How 15% Bad Records Translate to Agent Behavior

The math here is unforgiving because agents compound errors across steps. A single-step extraction task with a 15% bad-data rate produces a 15% error rate. But a five-step agentic workflow where each step has independent exposure to those bad records can produce failure rates that surprise teams used to thinking in terms of individual model accuracy.

The failure mode also depends on the type of badness. Missing fields cause different downstream behavior than malformed values, which cause different behavior than contradictory values. Missing fields often produce graceful-but-wrong outputs — the agent fills the gap with a plausible answer. Malformed values may cause schema validation failures that are at least visible. Contradictory values are the hardest: the agent picks one version, usually without any signal that a conflict existed.

The visibility problem is compounded by the fact that LLMs rarely say "I don't know" or "this data is inconsistent." They produce fluent, confident outputs regardless of input quality. A human analyst seeing conflicting data would flag the conflict. An agent produces a synthesis and moves on.

What Actually Works: Four Patterns

Data contracts at the source

A data contract is a formal agreement about what a dataset will contain: required fields and their types, acceptable null rates per field, value ranges and enumerated constraints, maximum allowable staleness, and semantic definitions. Contracts are checked at ingestion time, before data enters any pipeline that feeds an LLM.

The key is treating null rates as first-class metrics rather than incidental properties. A field like customer_segment being 5% null is a business problem. Being 30% null is a data problem. Being 80% null means the field doesn't actually exist in a meaningful sense and downstream systems should stop treating it as a reliable signal. Contracts make these thresholds explicit and enforced.

Freshness monitoring with circuit breakers

Stale data is especially dangerous for agents because it produces confident, period-specific wrong answers. An agent hallucinating a feature that doesn't exist is bad. An agent citing an actual feature that was deprecated 18 months ago is worse, because it sounds authoritative.

Freshness monitoring tracks when data assets were last updated and compares that against expected cadences. The useful addition for AI pipelines is a circuit breaker: when a data source exceeds its maximum staleness threshold, retrieval workflows that depend on it pause or reroute to a fallback rather than silently serving degraded data. This requires knowing which agent workflows depend on which data sources — which turns out to be a worthwhile exercise in itself.

Deduplication as a continuous process

Most deduplication happens once, at migration time, and then stops. Records continue to accumulate duplicates from imports, integrations, and user-created entries. For agents, duplicates are especially problematic because semantic search returns all instances of a concept, and the agent must synthesize across them. If those instances conflict, the synthesis is arbitrary.

Treating deduplication as a continuous monitoring concern — tracking entity collision rates, running periodic dedup checks against new records, alerting when the duplicate rate exceeds a threshold — prevents the index from drifting into an incoherent state over months of production use.

Root-cause triage that starts with data

When an agent produces bad output, the default debugging flow in most teams goes: look at the prompt, look at the retrieved context, look at the model output. The upstream data sources are inspected last, if at all.

Reversing this order is faster in practice. The first diagnostic question should be: is the answer that would be correct for this query actually present and accurate in the source data? If the source data is wrong, the agent is doing its job correctly given its inputs, and no amount of prompt work will fix the underlying problem.

This sounds obvious but requires instrumentation: the ability to trace a specific agent output back to the specific records that contributed to it, and then inspect those records directly. Without that traceability, teams are guessing.

The Root-Cause Test

A simple heuristic: take 20 cases where the agent produced wrong output. For each, pull the source records that were used. Ask whether a human analyst with those exact records would have produced the correct answer, or whether the records themselves were insufficient, wrong, or contradictory.

If the human would also get it wrong, the problem is upstream data quality. If the human would get it right but the agent got it wrong, the problem is somewhere in the retrieval or generation pipeline.

In practice, teams that run this test find that the first category — data problems — dominates. The exact number varies by domain and data maturity, but it's rarely less than half of failures and often significantly more.

This test is worth running before any prompt iteration cycle. It prevents the pattern where months of prompt engineering produce marginal gains because the actual constraint is a database that needs cleaning, not a model that needs better instructions.

Instrumentation to Add Now

If you're operating AI agents in production and don't have these in place, they're worth adding before any other work:

  • Per-field null rate tracking for all fields used as agent inputs. Alert when null rates exceed defined thresholds.
  • Document freshness metadata in all retrieval corpora. Index creation date, source last-modified date, and a staleness flag that updates automatically.
  • Entity collision monitoring — a count of distinct entities that resolve to the same canonical identifier. Alert when this grows faster than the underlying entity count.
  • Source record traceability — the ability to map an agent output back to the specific source records that contributed to it. This is necessary for any meaningful failure triage.
  • Freshness circuit breakers — automatic workflow pauses when a data dependency exceeds its maximum staleness threshold.

None of these are glamorous. They don't appear in benchmark scores or model cards. But in production AI systems, they're often the difference between an agent that's reliable enough to expand and one that's perpetually one support ticket away from being shut down.

The underlying model matters. Prompt quality matters. Retrieval architecture matters. But if 15% of your source records have malformed or missing fields, the ceiling on everything built on top of that data is 85% — regardless of which model you use or how well you've tuned the prompt.

References:Let's stay in touch and Follow me for more thoughts and updates