Skip to main content

Knowledge Graph Staleness Has a Different SLA Than Vector Staleness

· 10 min read
Tian Pan
Software Engineer

The vector index is wrong by approximately ten percent and nobody panics. The knowledge graph is wrong by one missing edge and somebody ships a wrong answer to a regulator. The two failure modes look identical from the data engineering org chart — both are "the index is stale" — and they sit behind the same change-data-capture pipeline with the same lag tolerance. The pipeline was sized for the vector workload because that was the louder consumer. The graph silently inherited those defaults, and the silence is the bug.

Vector retrieval and graph retrieval fail differently under staleness, and treating them as the same kind of lag problem is how you end up with a system that scores well on RAG benchmarks and is silently wrong on multi-hop queries — the silently-wrong case being, of course, the one users notice last. The fix is not faster pipelines. The fix is recognizing that "stale" means two different things, designing freshness tiers per edge class, and building the eval that catches the difference before a regulator does.

Vectors Degrade Continuously, Graphs Degrade Discontinuously

A vector index is a soft index. An embedding from yesterday and an embedding from today usually point in nearly the same direction in the latent space. If the underlying document was lightly edited, the new embedding's cosine similarity to the query drifts by a few percentage points. The retrieved document is still mostly right. The answer the model produces is still mostly right. The user notices nothing.

This is what people mean when they say vectors degrade gracefully. The failure surface is smooth: a small change in input produces a small change in retrieval rank, which produces a small change in answer quality. You can be a day behind, sometimes a week behind, and the system stays in the "approximately correct" basin of attraction. That is why teams happily run weekly re-embedding cycles on million-document corpora and call it a freshness strategy.

A knowledge graph is a hard index. A node either exists or doesn't. An edge either connects two nodes or doesn't. A multi-hop query that traverses Customer → Subscription → Plan → Region either finds a path or returns nothing. There is no "approximately right" path. The failure surface is a step function — one missing edge flips the answer from "yes" to "no," from "the customer is in EU jurisdiction" to "no records found." A deleted node makes every query that traversed it silently return an empty result, which the LLM then describes confidently as "no such relationship exists."

The asymmetry compounds because LLMs treat empty graph results the way they treat any empty retrieval — as ground truth. The vector path will at least surface a stale document for the model to reason about. The graph path returns nothing, the model assumes nothing exists, and the user gets a fluent paragraph asserting the negative. Stale vectors yield approximately right answers. Stale graphs yield confidently wrong answers. These are not the same SLA.

The Pipeline Inherits the Wrong Default

Most production architectures evolved by stacking. Vector retrieval shipped first because it was easier — chunk, embed, upsert, query. The change-data-capture pipeline was built around its needs: tolerable lag of hours to a day, batched re-embedding, eventual consistency. Then somebody added a graph layer for multi-hop queries because vectors couldn't follow relationships, and the new layer plugged into the existing CDC stream. Same Kafka topic, same lag SLA, same monitoring.

This works fine until you ship a feature that depends on the graph being current. The marketing team launches a new pricing tier. The pricing edge takes seven hours to land in the graph because that's when the next batch runs. For seven hours, every query asking "what plans are available for enterprise customers in Germany" returns the old set. The vector path also returns the old marketing PDF, but the model can hedge — "based on the available documents, the plans are X, but pricing may have updated recently." The graph path returns a deterministic list and the model presents it as fact.

The org failure mode here is structural. The data engineering team owns "the pipeline" as a unit. The AI team owns "retrieval" as a unit. Neither team has a name for the seam between vector freshness and graph freshness, so neither team owns it. When the marketing pricing case finally breaks, the postmortem blames "stale data" and proposes shortening the lag for everyone — which is expensive and still wrong, because not every edge needs sub-minute freshness.

Freshness Tiers, by Edge Class

The first move is to stop treating the graph as monolithically stale. Edges have different criticality, and the right SLA depends on what the edge represents:

  • Existence edges — does this entity exist, is this customer subscribed, does this employee have access. These flip answers from "yes" to "no" with no graceful degradation. Target SLA: seconds to minutes. These are the edges where bitemporal models earn their keep — record both the valid time (when the fact became true in the world) and the transaction time (when the system learned about it), so a query at any point can ask "what did we believe was true, when?"
  • Descriptive edges — what is this entity's name, address, current pricing, current owner. Stale values produce wrong-but-recoverable answers. The user will notice "the contact is John" when it should be "Jane," but they'll notice it as a correctable error, not a silent failure. Target SLA: minutes to hours.
  • Derived analytics edges — aggregate counts, computed scores, materialized rollups. Stale values produce slightly-off numbers in dashboards and explanatory paragraphs. Target SLA: hours to a day.

The architectural implication is that the CDC pipeline needs at least two lanes. The fast lane carries existence-edge mutations on a tight SLA, ideally near-real-time stream processing rather than batched ETL. The slow lane carries descriptive and analytics edges, batched aggressively for cost. Vector reindexing usually fits in the slow lane — embedding mutations rarely flip an answer from yes to no.

The reason most teams resist this is that it triples the operational complexity of the pipeline. They are right that it does. The trade-off is that you trade pipeline complexity for measurable correctness on the queries that matter most. You only commit to the dual-pipeline architecture when staleness becomes a measurable business problem — wrong access decisions, wrong jurisdictional routing, wrong regulatory disclosures. Until then, you can fake it with a single pipeline and a tighter global SLA, but you'll pay for that with re-embedding costs that scale with the largest blob in the corpus.

The Eval That Tells You the Graph Is Lying

Most RAG eval suites grade on the answer string. They ask: did the model return text that semantically matches the gold answer? This works for vectors because the failure mode is a shifted answer, not an absent one. It does not work for graphs, because the failure mode is "no result, model confabulated something plausible from context" — and a confabulation that sounds plausible can still semantically match the gold answer often enough to clear the eval bar.

The eval discipline for graph staleness is structural, not lexical. It needs three pieces:

First, a benchmark that simulates controlled missing edges. Recent research community efforts on topological incompleteness benchmarks construct queries where some edges are deliberately absent in the snapshot under test, then measure whether the system flags uncertainty rather than producing an answer. The TI-style benchmarks force the question: when the graph is incomplete, does the model say "I don't have enough information," or does it produce a fluent wrong answer?

Second, snapshot-based regression tests. Take a known-good graph state, take a deliberately-stale state (e.g., 24 hours behind real CDC), run the same multi-hop query against both, and assert the system either returns the same answer or surfaces the staleness. If both snapshots produce the same fluent answer, your system has no staleness signal at all — it is open-loop.

Third, refusal calibration. The metric you want is the rate at which the model refuses or hedges when the graph is missing the relevant subgraph, conditioned on whether a human would call the query unanswerable. This is the "unanswerable, uncheatable, multi-hop" eval class — the harness has to reward saying "I don't know" when the graph genuinely doesn't know. Most off-the-shelf RAG evals do not score refusal at all, which is why graph-staleness regressions ship undetected.

The tooling for this is still rough. Bitemporal graph stores like the temporal-knowledge-graph approaches in the agent memory community are getting closer to native support — they record valid-time and transaction-time on every edge, which means a query at any historical point is reproducible and a stale snapshot is queryable as a first-class object. Without that, you end up rebuilding snapshots from CDC logs and the eval cost dominates.

How the Org Drifts Into the Failure

A useful way to predict whether your team will ship the regulator-facing wrong answer is to look at three structural questions:

  • Who owns the freshness-tier decision per edge class? If the answer is "the pipeline team" or "the AI team," nobody owns it — the pipeline team thinks of edges as rows, and the AI team thinks of retrieval as a black box. The owner has to be someone who understands which queries are existence-edge-conditional and routes the SLA accordingly.
  • Does the eval suite distinguish "model hallucinated despite correct retrieval" from "model confabulated because retrieval returned nothing"? If the suite scores the answer string and not the retrieval state, both failure modes are summed into one accuracy number, and the graph regression hides under the noise floor.
  • Is there a forced refusal class in the eval? If every test expects a positive answer, the model that always answers wins on the eval and loses on the unanswerable production traffic. Refusal coverage is the canary.

When all three are unowned, the system enters the failure state quietly. The eval scores stay flat. Customer complaints arrive in a long tail — one wrong answer per day, hard to attribute, easy to dismiss as "model hallucination." The pattern only becomes visible when somebody traces a specific wrong answer back to a specific missing edge with a specific lag, and by then the system has been wrong for a quarter.

Treat the Graph as a Different System

The architectural realization to internalize is that a knowledge graph is not "another retrieval index." It is a system with different failure semantics, different freshness requirements, and a different relationship to model uncertainty. The vector store is a soft cache of meaning. The graph is a hard claim about reality. Pipelines and SLAs that treat them the same are running one of them at the wrong tolerance — usually the graph, because vector workloads dominated the original design and nobody renegotiated when the graph showed up.

The practical version of this discipline: classify your edges, split your CDC lanes by class, build a refusal-aware eval that grades structural correctness rather than lexical match, and put one named owner on the question of "what is the freshness tolerance of this edge type." That last one is the cheapest of the four and the one most teams skip. The cost of skipping it is paid in the answers nobody noticed were wrong.

References:Let's stay in touch and Follow me for more thoughts and updates