GraphRAG in Production: When Vector Search Fails at Multi-Hop Reasoning

April 12, 2026 · 9 min read

Software Engineer

Your RAG pipeline returns confident, well-formatted answers. The embeddings are tuned, the chunk size is optimized, and retrieval scores look great. Then a user asks "Which suppliers affected by the port strike also have contracts expiring this quarter?" and the system returns irrelevant fragments about port logistics and contract management — separately, never connecting them. This is the multi-hop reasoning gap, and it's where vector search quietly fails.

The failure isn't a tuning problem — it's architectural. Vector similarity finds documents that look like the query but cannot traverse relationships between entities scattered across different documents. GraphRAG — retrieval augmented generation backed by knowledge graphs — addresses this by making entity relationships first-class retrieval objects. But shipping it to production is harder than the demos suggest.

Why Embeddings Hit a Ceiling

Vector search works by converting text into high-dimensional points and finding the nearest neighbors. This is powerful for semantic similarity: "How do I reset my password?" matches "Steps to change your login credentials" even though no words overlap. For single-hop factual retrieval, it's often sufficient.

The breakdown happens when answering a question requires connecting facts from multiple documents. Consider a compliance question: "Which of our European vendors have failed security audits in the past year and are also processing PII?" Answering this requires:

Identifying entities tagged as European vendors
Linking those to security audit results
Cross-referencing against data processing agreements

No single chunk contains this answer. The relevant information lives across vendor profiles, audit reports, and contracts. Vector similarity retrieves chunks that individually discuss European vendors, security audits, or PII processing — but it has no mechanism to intersect these sets.

Benchmarks confirm the gap. On the MultiHop-RAG dataset, GraphRAG achieves 71.17% accuracy versus standard RAG's 65.77%. On HotpotQA, graph-based retrieval improves precision by up to 35% over vector-only approaches. The more hops a question requires, the wider the gap becomes.

How GraphRAG Actually Works

GraphRAG adds a knowledge graph layer between your documents and the LLM. The pipeline has three phases: extraction, indexing, and retrieval.

Extraction is the expensive part. An LLM reads each document chunk and extracts entities (people, organizations, concepts, products) and the relationships between them. "Acme Corp signed a three-year contract with Globex for cloud infrastructure" produces entities (Acme Corp, Globex, cloud infrastructure) and relations (signed_contract, contract_duration: 3 years). This runs at indexing time, not query time — but it means every document passes through an LLM, not just an embedding model.

Indexing builds the graph structure. Extracted entities are deduplicated (entity resolution), relationships are normalized, and the graph is stored in a database that supports traversal queries. Some implementations — notably Microsoft's GraphRAG — go further, detecting community structures within the graph and generating summaries for clusters of closely related entities.

Retrieval is where the payoff happens. When a query arrives, the system extracts entities from the question, locates them in the graph, and traverses relationships to gather connected context. Instead of returning the top-k most similar chunks, it returns a subgraph: the entities relevant to the query, their relationships, and the source documents that established those relationships.

This means the retrieval for "Which suppliers affected by the port strike also have contracts expiring this quarter?" can follow edges from port-strike-affected suppliers to their contract entities, filter by expiration date, and return a coherent answer that no single document chunk could provide.

The Entity Resolution Problem Nobody Warns You About

The hardest part of building a production GraphRAG system isn't choosing a graph database or writing traversal queries. It's entity resolution — ensuring that "IBM," "International Business Machines," "IBM Corp," and "Big Blue" all point to the same node.

LLM-based entity extraction is inconsistent. The same entity surfaces with different names, abbreviations, and descriptions across documents. Without aggressive deduplication, your graph fills with near-duplicate nodes that fragment relationships. A question about IBM's contracts might miss half the relevant data because some contracts are filed under "International Business Machines."

Production systems address this with multiple strategies layered together:

Hash-based deduplication on normalized entity names catches exact matches
Embedding similarity between entity descriptions catches semantic duplicates
LLM-based merging for ambiguous cases, where the model decides whether two entity descriptions refer to the same real-world object
Domain-specific rules like canonical name mappings maintained by data teams

Each layer adds latency and cost to the indexing pipeline. The LLM-based merging step alone can double your extraction costs. Teams that skip this step invariably regret it — a knowledge graph with fragmented entities is worse than no knowledge graph, because it creates a false sense of completeness.

The Cost Math That Changes Your Architecture Decision

GraphRAG's construction cost is the elephant in the room. Systematic benchmarks show the numbers clearly:

Index construction time: GraphRAG takes 5,500–7,700 seconds versus 135 seconds for standard RAG on equivalent corpora. That's 40–57x slower.
Per-query latency: Knowledge-graph-based GraphRAG averages 14,434ms per query due to LLM-based entity expansion, versus 1,724ms for vector RAG. Community-based GraphRAG fares better at 1,249ms.
Token consumption: A complex retrieval that costs ~100 tokens in standard RAG can require 610,000+ tokens in naive GraphRAG implementations.

These numbers make vanilla GraphRAG impractical for most production workloads. This is why LightRAG — which eliminates expensive community clustering in favor of dual-level retrieval with lightweight graph structures — has gained traction. LightRAG achieves comparable accuracy with 99% fewer tokens, processing documents at ~ $0.15 that cost$ 4–7 with full GraphRAG. It won best paper at EMNLP 2025 for a reason.

The practical decision framework:

Use vector RAG when queries are single-hop, your documents are self-contained, and latency matters more than reasoning depth
Use GraphRAG when queries require connecting entities across documents, your domain has rich entity relationships, and you can afford the indexing cost
Use LightRAG or hybrid approaches when you need multi-hop reasoning but can't justify full GraphRAG's compute budget — which is most teams

Building the Query Router

Production GraphRAG systems rarely run graph retrieval on every query. Most questions don't need it, and the latency penalty makes it wasteful. The solution is a query router that classifies incoming questions and selects the retrieval strategy.

A practical routing approach uses three signals:

Entity density: Parse the query for named entities. Queries with multiple distinct entities ("Show me all Project Alpha deliverables reviewed by the security team") are graph candidates. Single-entity or conceptual queries ("What is our refund policy?") go to vector search.

Hop detection: Look for relational language — "connected to," "related to," "also," "which of X are Y" — that implies traversal. Questions asking for intersections of properties across entity types almost always require graph retrieval.

Fallback with escalation: Start with vector retrieval. If the answer confidence is low or the response fails coherence checks, re-route to graph retrieval. This adds latency for complex queries but avoids the graph overhead for simple ones.

The hybrid integration strategy — running both vector and graph retrieval and merging results with reciprocal rank fusion — shows a 6.4% accuracy improvement on multi-hop benchmarks. But it doubles token consumption (13,401 average retrieved tokens versus 3,631 for vector-only). Reserve it for use cases where accuracy justifies the cost.

Keeping the Graph Alive

A knowledge graph that isn't updated is a liability. Unlike vector indexes where you can incrementally add new embeddings, graph updates create cascading complexity. A new document might introduce an entity that should merge with an existing node, or update a relationship that invalidates downstream community summaries.

Production maintenance requires:

Incremental extraction: Process only new or modified documents, not the full corpus. LightRAG's architecture supports this natively with ~50% faster updates than full reindexing.
Conflict resolution: When new information contradicts existing graph relationships (a vendor's status changes from "active" to "suspended"), the system needs rules for which source wins.
Staleness detection: Relationships have temporal validity. A "current supplier" relationship from two years ago may no longer hold. Graph edges need timestamps and expiration logic.
Monitoring coverage: Track the percentage of query entities that exist in the graph. Research shows only 65.8% of answer entities typically exist in constructed graphs — meaning a third of queries hit dead ends. Monitor this metric and use it to prioritize extraction from under-indexed document collections.

When to Start (and When to Wait)

The Gradient Flow analysis of production GraphRAG deployments delivers a sobering assessment: "We barely know of any examples of production deployments that are offering real business value." Most implementations remain experimental.

This doesn't mean GraphRAG is vaporware. It means the technology is at the stage where it rewards careful, incremental adoption:

Get vector RAG working well first. If your basic retrieval pipeline has poor chunk quality, bad embeddings, or weak prompts, GraphRAG won't fix those problems — it'll amplify them.
Identify your multi-hop queries. Audit your query logs. If fewer than 10% of questions require cross-document reasoning, the engineering investment probably isn't justified yet.
Start with a narrow domain graph. Don't try to graph your entire document corpus. Pick a high-value subset — contracts, compliance documents, product specifications — where entity relationships are dense and well-defined.
Benchmark aggressively. Build an evaluation set of multi-hop questions with ground-truth answers. If GraphRAG doesn't beat your vector baseline by a meaningful margin on these specific questions, it's not ready for your data.

The teams that succeed with GraphRAG in production are the ones that treat it as an augmentation to vector search, not a replacement. The graph handles the 15–20% of queries that vector search structurally cannot answer. The vector index handles everything else. The router decides. That's the architecture that ships.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

GraphRAG in Production: When Vector Search Fails at Multi-Hop Reasoning

Why Embeddings Hit a Ceiling

How GraphRAG Actually Works

The Entity Resolution Problem Nobody Warns You About

The Cost Math That Changes Your Architecture Decision

Building the Query Router

Keeping the Graph Alive

When to Start (and When to Wait)

Recommended Reading

About Tian Pan

Why Embeddings Hit a Ceiling​

How GraphRAG Actually Works​

The Entity Resolution Problem Nobody Warns You About​

The Cost Math That Changes Your Architecture Decision​

Building the Query Router​

Keeping the Graph Alive​

When to Start (and When to Wait)​

Recommended Reading

About Tian Pan

Why Embeddings Hit a Ceiling

How GraphRAG Actually Works

The Entity Resolution Problem Nobody Warns You About

The Cost Math That Changes Your Architecture Decision

Building the Query Router

Keeping the Graph Alive

When to Start (and When to Wait)