722 posts tagged with "insider"

GraphRAG vs. Vector RAG: When Knowledge Graphs Beat Embeddings

April 17, 2026 · 9 min read

Software Engineer

Most teams reach for vector embeddings when building RAG pipelines. It's the obvious default: embed documents, embed queries, find the nearest neighbors, feed results to the LLM. It works well enough on the demos. Then they deploy to a compliance team or a scientific literature corpus, and accuracy falls off a cliff. Not gradually — abruptly. On queries involving five or more entities, vector RAG accuracy in enterprise analytics benchmarks drops to zero. Not 50%. Not 20%. Zero.

This isn't a configuration problem. It's an architectural mismatch. Vector retrieval treats documents as points in semantic space. Knowledge graphs treat them as nodes in a relational structure. When your queries require traversing relationships — not just finding similar content — the topology of your retrieval architecture is what determines whether you get the right answer.

Where to Put the Human: Placement Theory for AI Approval Gates

April 17, 2026 · 12 min read

Tian Pan

Software Engineer

Most teams add human-in-the-loop review as an afterthought: the agent finishes its chain of work, the result lands in a review queue, and a human clicks approve or reject. This feels like safety. It is mostly theater.

By the time a multi-step agent reaches end-of-chain review, it has already sent the API requests, mutated the database rows, drafted the customer email, and scheduled the follow-up. The "review" is approving a done deal. Declining it means explaining to the agent — and often to the user — why nothing that happened for the past 10 minutes will stick.

The damage from misplaced approval gates isn't always dramatic. Often it's subtler: reviewers who approve everything because the real decisions have already been made, engineers who add more checkpoints after incidents and watch trust in the product crater, and organizations that oscillate between "too much friction" and "not enough oversight" without ever solving the underlying placement problem.

The Insider Threat You Created When You Deployed Enterprise AI

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Most enterprise security teams have a reasonably well-developed model for insider threats: a disgruntled employee downloads files to a USB drive, emails a spreadsheet to a personal account, or walks out with credentials. The detection playbook is known — DLP rules, egress monitoring, UEBA baselines. What those playbooks don't account for is the scenario where you handed every one of your employees a tool that can plan, execute, and cover multi-stage operations at machine speed. That's what deploying AI coding assistants and RAG-based document agents actually does.

The problem isn't that these tools are insecure in isolation. It's that they dramatically amplify what a compromised or malicious insider can accomplish in a single session. The average cost of an insider incident has reached $17.4 million per organization annually, and 83% of organizations experienced at least one insider attack in the past year. AI tools don't introduce a new threat category — they multiply the capability of every threat category that already exists.

The Instruction Complexity Cliff: Why LLMs Follow 5 Rules Reliably but Not 15

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

There's a pattern that shows up in almost every production AI system: the team starts with a focused system prompt, ships the feature, and then iterates. A new edge case surfaces, so they add a rule. Another ticket comes in, another rule. Six months later the system prompt has grown to 2,000 tokens and covers 20 distinct behavioral requirements. The AI still sounds coherent on most requests. But subtle compliance failures have been creeping in for weeks — formatting ignored here, a tone requirement skipped there, an escalation rule quietly bypassed. Nobody flagged it because no individual failure was dramatic enough to page anyone.

This isn't a model quality problem. It's a fundamental architectural characteristic of how transformer-based language models process instructions, and there's a substantial body of empirical research that makes the failure modes predictable. Understanding it changes how you should write system prompts.

The Jagged Frontier: Why AI Fails at Easy Things and What It Means for Your Product

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

A common assumption in AI product development goes something like this: if a model can handle a hard task, it can definitely handle an easier one nearby. This assumption is wrong, and it's responsible for a category of production failures that no amount of benchmark reading prepares you for.

The research term for the underlying phenomenon is the "jagged frontier" — AI's capability boundary isn't a smooth line that hard tasks sit outside of and easy tasks sit inside. It's a ragged, unpredictable shape. AI systems can write production-grade database query optimizers and still miscalculate whether two line segments on a diagram intersect. They can pass PhD-level science exams and fail children's riddle questions that involve spatial relationships. They can synthesize 50-page documents and then confidently hallucinate a summary of a paragraph they just read.

When LLMs Beat Rule-Based Systems for Data Normalization (And When They Don't)

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

A team I know spent three months building a rule-based address normalizer. It handled the top twenty formats, used a USPS API for verification, and worked great on the data they'd seen. Then they got a new enterprise customer. The first week of data had addresses embedded in freeform notes fields, postal codes missing country prefixes, and cross-border formats their rules had never seen. The normalizer failed silently on 31% of records. They threw an LLM at it as a quick fix, expecting 80% accuracy. They got 94%. The surprise wasn't that the LLM worked — it was that nothing in their evaluation framework had predicted this.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=When%20LLMs%20Beat%20Rule-Based%20Systems%20for%20Data%20Normalization%20(And%20When%20They%20Don't%29)

This is the shape of the problem. Rule-based normalization is predictable, fast, and cheap. It works well when the data distribution stays in-bounds. LLMs handle the long tail — the weird formats, the implicit domain knowledge, the edge cases that rules never enumerate. But LLMs are also expensive, slow, and inconsistent in ways that break production pipelines if you're not careful. The right answer, for almost every team, is a hybrid that uses each approach on the inputs it's actually good at.

Why LLMs Make Confident Mistakes When Analyzing Your Product Data

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Product teams have started routing analytical questions directly to LLMs: "What's causing the churn spike?" "Why did conversion drop after the redesign?" "Which cohort should we focus retention spend on?" The outputs land in executive decks, drive roadmap decisions, and get presented to investors. The models answer confidently, in polished prose, with specific numbers. And a significant fraction of those answers are wrong in ways that don't announce themselves.

This isn't a general criticism of LLMs for data work. There are tasks where they genuinely help. The problem is that the failure modes are invisible — the model doesn't hedge, doesn't caveat, and doesn't distinguish between "I computed this from your data" and "I generated something that sounds like what this number should be." Practitioners who understand where the breakdowns happen can capture the genuine value and route around the landmines.

The Hidden Switching Costs of LLM Vendor Lock-In

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Most engineering teams believe they've insulated themselves from LLM vendor lock-in. They use LiteLLM to unify API calls. They avoid fine-tuning on hosted platforms. They keep raw data in their own storage. They feel safe. Then a provider announces a deprecation — or a competitor's pricing drops 40% — and the team discovers that the abstraction layer they built handles roughly 20% of the actual switching cost.

The other 80% is buried in places no one looked: system prompts written around a model's formatting quirks, eval suites calibrated to one model's refusal thresholds, embedding indexes that become incompatible the moment you change models, and user expectations shaped by behavioral patterns that simply don't transfer.

The Minimal Footprint Principle: Least Privilege for Autonomous AI Agents

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

A retail procurement agent inherited vendor API credentials "during initial testing." Nobody ever restricted them before the system went to production. When a bug caused an off-by-one error, the agent had full ordering authority — permanently, with no guardrails. By the time finance noticed, $47,000 in unauthorized vendor orders had gone out. The code was fine. The model performed as designed. The blast radius was a permissions problem.

This is the minimal footprint principle: agents should request only the permissions the current task requires, avoid persisting sensitive data beyond task scope, clean up temporary resources, and scope tool access to present intent. It is the Unix least-privilege principle adapted for a world where your code makes runtime decisions about what it needs to do next.

The reason teams get this wrong is not negligence. It is a category error: they treat agent permissions as a design-time exercise when agentic AI makes them a runtime problem.

Multi-Region LLM Serving: The Cache Locality Problem Nobody Warns You About

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

When you run a stateless HTTP API across multiple regions, the routing problem is essentially solved. Put a global load balancer in front, distribute requests by geography, and the worst thing that happens is a slightly stale cache entry. Any replica can serve any request with identical results.

LLM inference breaks every one of these assumptions. The moment you add prompt caching — which you will, because the cost difference between a cache hit and a cache miss is roughly 10x — your service becomes stateful in ways that most infrastructure teams don't anticipate until they're staring at degraded latency numbers in their second region.

The Multi-Tenant LLM Problem: Noisy Neighbors, Isolation, and Fairness at Scale

April 17, 2026 · 12 min read

Tian Pan

Software Engineer

Your SaaS product launches with ten design customers. Everything works beautifully. Then you onboard a hundred tenants, and one of them — a power user running 200K-token context windows on a complex research workflow — causes every other customer's latency to spike. Support tickets start arriving. You look at your dashboards and see nothing obviously wrong: your model is healthy, your API returns 200s, and your p50 latency looks fine. Your p95 has silently tripled.

This is the noisy neighbor problem, and it hits LLM infrastructure harder than almost any other shared system. Here's why it's harder to solve than it is in databases — and the patterns that actually work.

The Multi-Turn Session State Collapse Problem

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your per-request error rates look clean. Latency is within SLO. The LLM judge is scoring outputs at 87%. And then a user files a support ticket: "I told the bot my account number three times. It just asked me again." A different user: "It agreed to a refund, then two turns later denied the policy existed."

Single-turn failures are visible. The request comes in, the model hallucinates or refuses, your eval catches it, you fix the prompt. The feedback loop is tight. Multi-turn failures work differently: the session starts fine, degrades gradually turn by turn, and your monitoring never fires because each individual response is technically coherent. The problem is the session as a whole — and almost no team instruments for that.

Research across major frontier models (Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro) shows an average 39% performance drop when moving from single-turn to multi-turn conversations. That number hides the real story: only about 16% of the drop is capability loss. The other 23 points are a reliability crisis — the gap between a model's best and worst performance on the same task doubles as conversation length grows. You're not just getting worse outputs; you're getting inconsistent ones.

About Tian Pan