Skip to main content

15 posts tagged with "ai-infrastructure"

View all tags

Database-Native AI: When Your Postgres Learns to Embed

· 7 min read
Tian Pan
Software Engineer

Most RAG architectures look the same: your application reads from Postgres, ships the text to an embedding API, writes vectors to Pinecone or Weaviate, and queries both systems at read time. You maintain two data stores, two consistency models, two backup strategies, and a synchronization pipeline that is always one edge case away from letting your vector index drift weeks behind your source of truth.

What if the database just did it all? That is no longer a hypothetical. PostgreSQL extensions like pgvector, pgai, and pgvectorscale — along with managed offerings like AlloyDB AI — are collapsing the entire embedding-and-retrieval stack into the database itself. The result is not just fewer moving parts. It is a fundamentally different operational model where your vectors are always transactionally consistent with the data they represent.

Provider Lock-In Anatomy: The Seven Coupling Points That Make Switching LLM Providers a 6-Month Project

· 10 min read
Tian Pan
Software Engineer

Every team that ships an LLM-powered feature eventually has the same conversation: "What if we need to switch providers?" The standard answer — "we'll just swap the API key" — reveals a dangerous misunderstanding of where coupling actually lives. In practice, teams that attempt a provider migration discover that the API endpoint is the least of their problems. The real lock-in hides in seven distinct coupling points, each capable of turning a "quick swap" into a quarter-long project.

Migration expenses routinely consume 20–50% of original development time. Enterprise teams who treat model switching as plug-and-play grapple with broken outputs, ballooning token costs, and shifts in reasoning quality that take weeks to diagnose. Understanding where these coupling points are — before you need to migrate — is the difference between a controlled transition and an emergency scramble.

The Hidden Token Tax: How Overhead Silently Drains Your LLM Context Window

· 8 min read
Tian Pan
Software Engineer

Most teams know how many tokens their users send. Almost none know how many tokens they spend before a user says anything at all.

In a typical production LLM pipeline, system prompts, tool schemas, chat history, safety preambles, and RAG prologues silently consume 30–60% of your context window before the actual user query arrives. For agentic systems with dozens of registered tools, that overhead can hit 45% of a 128k window — roughly 55,000 tokens — on tool definitions that never get called.

This is the hidden token tax. It inflates costs, increases latency, and degrades output quality — yet it never shows up in any user-facing metric.