28 posts tagged with "vector-search"

When Vector Search Fails: Why Knowledge Graphs Handle Queries Embeddings Can't

April 20, 2026 · 9 min read

Software Engineer

Vector search has become the default retrieval primitive for RAG systems. Embed your documents, embed the query, find nearest neighbors — it's simple, fast, and works surprisingly well for a wide class of questions. But production deployments keep hitting the same wall: certain queries return garbage results despite high similarity scores, certain multi-document reasoning tasks fail silently, and certain entity-heavy queries degrade to random noise as complexity grows.

The issue isn't embedding quality or index size. It's that semantic similarity is the wrong abstraction for a significant class of retrieval problems. Knowledge graphs aren't a replacement for vector search — they solve a structurally different problem. Understanding which problems belong to which tool is what separates a brittle RAG pipeline from one that holds up in production.

Corpus Architecture for RAG: The Indexing Decisions That Determine Quality Before Retrieval Starts

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

When a RAG system returns the wrong answer, the post-mortem almost always focuses on the same suspects: the retrieval query, the similarity threshold, the reranker, the prompt. Teams spend days tuning these components while the actual cause sits untouched in the indexing pipeline. The failure happened weeks ago when someone decided on a chunk size.

Most RAG quality problems are architectural, not operational. They stem from decisions made at index time that silently shape what the LLM will ever be allowed to see. By the time a user complains, the retrieval system is doing exactly what it was designed to do — it's just that the design was wrong.

Cross-Encoder Reranking in Practice: What Cosine Similarity Misses

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG pipeline retrieves the top 10 documents and your LLM still gives a wrong answer. You increase the retrieval count to 50. Still wrong. The frustrating part: the correct document was in your vector store the whole time—it was just ranked 23rd. This is not a recall problem. It's a ranking problem, and cosine similarity is the culprit.

Vector search does a decent job of finding semantically adjacent content. But "semantically adjacent" and "most useful for this specific query" are not the same thing. Cosine similarity measures the angle between two vectors in embedding space, and that angle only captures a coarse notion of topical proximity. What it cannot capture is the fine-grained interaction between the specific words in your query and the specific words in a document—the difference between "how to prevent buffer overflows" and "buffer overflow exploit techniques" is subtle at the vector level but critical for your retrieval system.

The Embedding Refresh Problem: Running a Vector Store Like a Database Engineer

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG pipeline is returning confident, well-formatted answers. The LLM response looks great. And yet users keep filing tickets saying the system is wrong. The product manager pulls up the document in question — the information changed six weeks ago, but the vector index still reflects the old version. No errors were thrown. No alerts fired. The system was just silently, invisibly wrong.

This is the embedding refresh problem, and it bites most production RAG systems eventually. Analysis across production deployments shows that over 60% of RAG failures trace back to stale or outdated information in the knowledge base — not bad prompts, not retrieval algorithm failures, but a simple mismatch between what's in the vector index and what's true in the source. Most AI engineers discover this the hard way. Most data engineers already know how to prevent it.

GraphRAG vs. Vector RAG: The Architecture Decision Teams Make Too Late

April 19, 2026 · 12 min read

Tian Pan

Software Engineer

Most teams discover they need GraphRAG six months too late — after they've already explained to users why the AI got the relationship wrong, why it confused two entities that share similar embeddings, or why it confidently cited a document that contradicts the actual answer. Vector RAG is genuinely good at what it does. The problem is that teams treat it as good at everything, and keep piling on retrieval hacks when the underlying architecture has hit a mathematical ceiling.

Fewer than 15% of enterprises have deployed graph-based retrieval in production as of 2025. This is not because the technology is immature. It's because the failure signals for vector-only RAG are subtle: the system runs, the LLM responds, and only careful inspection reveals that the retrieved context was plausible but wrong.

Retrieval Monoculture: Why Your RAG System Has Systematic Blind Spots

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG system's evals look fine. NDCG is acceptable. The demo works. But there's a category of failure no single-metric eval catches: the queries your retriever never even gets close on, consistently, because your entire embedding space was never equipped to handle them in the first place.

That's retrieval monoculture. One embedding model. One similarity metric. One retrieval path — and therefore one set of systematic blind spots that look like model errors, hallucination, or user confusion until you actually examine the retrieval layer.

The fix is not a bigger model or more data. It's understanding that different query structures need different retrieval mechanisms, and building a system that stops routing everything through the same funnel.

Retrieval Debt: Why Your RAG Pipeline Degrades Silently Over Time

April 18, 2026 · 10 min read

Tian Pan

Software Engineer

Six months after you shipped your RAG pipeline, something changed. Users aren't complaining loudly — they're just trusting the answers a little less. Feedback ratings dropped from 4.2 to 3.7. A few support tickets reference "outdated information." Your engineers look at the logs and see no errors, no timeouts, no obvious regression. The retrieval pipeline looks healthy by every metric you've configured.

It isn't. It's rotting.

Retrieval debt is the accumulated technical decay in a vector index: stale embeddings that no longer represent current document content, tombstoned chunks from deleted records that pollute search results, and semantic drift between the encoder version that indexed your corpus and the encoder version now computing query embeddings. Unlike code rot, retrieval debt produces no stack traces. It produces subtly wrong answers with confident-looking citations.

The Embedding Drift Problem: How Your Semantic Search Silently Degrades

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

Your semantic search is probably getting worse right now, and your dashboards are not telling you.

There is no error log. No p99 spike. No failed health check. Queries still return results with high cosine similarity scores. But the relevance is quietly deteriorating, one missed term at a time, as the language your users type diverges from the language your embedding model was trained on.

This is the embedding drift problem. It is insidious precisely because it produces no visible failure signal — only a slow erosion of retrieval quality that users attribute to the product being "not that useful anymore" before they stop using it entirely.

Database-Native AI: When Your Postgres Learns to Embed

April 13, 2026 · 7 min read

Tian Pan

Software Engineer

Most RAG architectures look the same: your application reads from Postgres, ships the text to an embedding API, writes vectors to Pinecone or Weaviate, and queries both systems at read time. You maintain two data stores, two consistency models, two backup strategies, and a synchronization pipeline that is always one edge case away from letting your vector index drift weeks behind your source of truth.

What if the database just did it all? That is no longer a hypothetical. PostgreSQL extensions like pgvector, pgai, and pgvectorscale — along with managed offerings like AlloyDB AI — are collapsing the entire embedding-and-retrieval stack into the database itself. The result is not just fewer moving parts. It is a fundamentally different operational model where your vectors are always transactionally consistent with the data they represent.

GraphRAG in Production: When Vector Search Fails at Multi-Hop Reasoning

April 12, 2026 · 9 min read

Tian Pan

Software Engineer

Your RAG pipeline returns confident, well-formatted answers. The embeddings are tuned, the chunk size is optimized, and retrieval scores look great. Then a user asks "Which suppliers affected by the port strike also have contracts expiring this quarter?" and the system returns irrelevant fragments about port logistics and contract management — separately, never connecting them. This is the multi-hop reasoning gap, and it's where vector search quietly fails.

The failure isn't a tuning problem — it's architectural. Vector similarity finds documents that look like the query but cannot traverse relationships between entities scattered across different documents. GraphRAG — retrieval augmented generation backed by knowledge graphs — addresses this by making entity relationships first-class retrieval objects. But shipping it to production is harder than the demos suggest.

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Most RAG tutorials treat chunking as a footnote: split your documents into 512-token chunks, embed them, store them in a vector database, and move on to the interesting parts. This works well enough on toy examples — Wikipedia articles, clean markdown docs, short PDFs. It falls apart in production.

A recent study deploying RAG for clinical decision support found that the fixed-size baseline achieved 13% fully accurate responses across 30 clinical questions. An adaptive chunking approach on the same corpus: 50% fully accurate (p=0.001). The documents were the same. The LLM was the same. Only the chunking changed. That gap is not a tuning problem or a prompt engineering problem. It is a structural failure in how most teams split documents.

Semantic Caching for LLMs: The Cost Tier Most Teams Skip

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams building LLM applications know about prompt caching — the prefix-reuse mechanism that API providers offer to discount repeated input tokens. Far fewer have deployed the layer above it: semantic caching, which eliminates LLM calls entirely for queries that mean the same thing but are phrased differently. The gap isn't laziness; it's a widespread misunderstanding of what "95% accuracy" means in semantic caching vendor documentation.

That 95% figure refers to match correctness on cache hits, not to how often the cache actually gets hit. Real production hit rates range from 10% for open-ended chat to 70% for structured FAQ systems — and the math that determines which side of that range you're on should happen before you write any cache code.

About Tian Pan