33 posts tagged with "vector-search"

The Embedding Drift Problem: How Your Semantic Search Silently Degrades

April 16, 2026 · 9 min read

Software Engineer

Your semantic search is probably getting worse right now, and your dashboards are not telling you.

There is no error log. No p99 spike. No failed health check. Queries still return results with high cosine similarity scores. But the relevance is quietly deteriorating, one missed term at a time, as the language your users type diverges from the language your embedding model was trained on.

This is the embedding drift problem. It is insidious precisely because it produces no visible failure signal — only a slow erosion of retrieval quality that users attribute to the product being "not that useful anymore" before they stop using it entirely.

Database-Native AI: When Your Postgres Learns to Embed

April 13, 2026 · 7 min read

Tian Pan

Software Engineer

Most RAG architectures look the same: your application reads from Postgres, ships the text to an embedding API, writes vectors to Pinecone or Weaviate, and queries both systems at read time. You maintain two data stores, two consistency models, two backup strategies, and a synchronization pipeline that is always one edge case away from letting your vector index drift weeks behind your source of truth.

What if the database just did it all? That is no longer a hypothetical. PostgreSQL extensions like pgvector, pgai, and pgvectorscale — along with managed offerings like AlloyDB AI — are collapsing the entire embedding-and-retrieval stack into the database itself. The result is not just fewer moving parts. It is a fundamentally different operational model where your vectors are always transactionally consistent with the data they represent.

GraphRAG in Production: When Vector Search Fails at Multi-Hop Reasoning

April 12, 2026 · 9 min read

Tian Pan

Software Engineer

Your RAG pipeline returns confident, well-formatted answers. The embeddings are tuned, the chunk size is optimized, and retrieval scores look great. Then a user asks "Which suppliers affected by the port strike also have contracts expiring this quarter?" and the system returns irrelevant fragments about port logistics and contract management — separately, never connecting them. This is the multi-hop reasoning gap, and it's where vector search quietly fails.

The failure isn't a tuning problem — it's architectural. Vector similarity finds documents that look like the query but cannot traverse relationships between entities scattered across different documents. GraphRAG — retrieval augmented generation backed by knowledge graphs — addresses this by making entity relationships first-class retrieval objects. But shipping it to production is harder than the demos suggest.

Why the Chunking Problem Isn't Solved: How Naive RAG Pipelines Hallucinate on Long Documents

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Most RAG tutorials treat chunking as a footnote: split your documents into 512-token chunks, embed them, store them in a vector database, and move on to the interesting parts. This works well enough on toy examples — Wikipedia articles, clean markdown docs, short PDFs. It falls apart in production.

A recent study deploying RAG for clinical decision support found that the fixed-size baseline achieved 13% fully accurate responses across 30 clinical questions. An adaptive chunking approach on the same corpus: 50% fully accurate (p=0.001). The documents were the same. The LLM was the same. Only the chunking changed. That gap is not a tuning problem or a prompt engineering problem. It is a structural failure in how most teams split documents.

Semantic Caching for LLMs: The Cost Tier Most Teams Skip

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams building LLM applications know about prompt caching — the prefix-reuse mechanism that API providers offer to discount repeated input tokens. Far fewer have deployed the layer above it: semantic caching, which eliminates LLM calls entirely for queries that mean the same thing but are phrased differently. The gap isn't laziness; it's a widespread misunderstanding of what "95% accuracy" means in semantic caching vendor documentation.

That 95% figure refers to match correctness on cache hits, not to how often the cache actually gets hit. Real production hit rates range from 10% for open-ended chat to 70% for structured FAQ systems — and the math that determines which side of that range you're on should happen before you write any cache code.

Embedding Models in Production: Selection, Versioning, and the Index Drift Problem

April 9, 2026 · 10 min read

Tian Pan

Software Engineer

Your RAG answered correctly yesterday. Today it contradicts itself. Nothing obvious changed — except your embedding provider quietly shipped a model update and your index is now a Frankenstein of mixed vector spaces.

Embedding models are the unsexy foundation of every retrieval-augmented system, and they fail in ways that are uniquely hard to diagnose. Unlike a prompt change or a model parameter tweak, embedding model problems surface slowly, as silent quality degradation that your evals don't catch until users start complaining. This post covers three things: how to pick the right embedding model for your domain (MTEB scores mislead more than they help), what actually happens when you upgrade a model, and the versioning patterns that let you swap models without rebuilding from scratch.

The Production Retrieval Stack: Why Pure Vector Search Fails and What to Do Instead

April 9, 2026 · 12 min read

Tian Pan

Software Engineer

Most RAG systems are deployed with a vector database, a few thousand embeddings, and the assumption that semantic similarity is close enough to correctness. It is not. That gap between "semantically similar" and "actually correct" is why 73% of RAG systems fail in production, and almost all of those failures happen at the retrieval stage — before the LLM ever generates a word.

The standard playbook of "embed your documents, query with cosine similarity, pass top-k to the LLM" works in demos because demo queries are designed to work. Production queries are not. Users search for product IDs, invoice numbers, regulation codes, competitor names spelled wrong, and multi-constraint questions that a single embedding vector cannot geometrically satisfy. Dense vector search is not wrong — it is incomplete. Building a retrieval stack that actually works in production requires understanding why, and layering in the components that compensate.

Memory Architectures for Production AI Agents

October 21, 2025 · 10 min read

Tian Pan

Software Engineer

Most teams add memory to their agents as an afterthought — usually after a user complains that the agent forgot something it was explicitly told three sessions ago. At that point, the fix feels obvious: store conversations somewhere and retrieve them later. But this intuition leads to systems that work in demos and fall apart in production. The gap between a memory system that stores things and one that reliably surfaces the right things at the right time is where most agent projects quietly fail.

Memory architecture is not a peripheral concern. For any agent handling multi-session interactions — customer support, coding assistants, research tools, voice interfaces — memory is the difference between a stateful assistant and a very expensive autocomplete. Getting it wrong doesn't produce crashes; it produces agents that feel subtly broken, that contradict themselves, or that confidently repeat outdated information the user corrected two weeks ago.

Beyond RAG: Hybrid Search, Agentic Retrieval, and the Database Design Decisions That Actually Matter

October 2, 2025 · 8 min read

Tian Pan

Software Engineer

Most teams ship RAG and call it a retrieval strategy. They chunk documents, embed them, store the vectors, and run nearest-neighbor search at query time. It works well enough in demos. In production, users start reporting that the system can't find an article they know exists, misses error codes verbatim in the docs, or returns semantically similar but factually wrong passages.

The problem isn't RAG. The problem is treating retrieval as a one-dimensional problem when it's always been multi-dimensional.

About Tian Pan