Skip to main content

907 posts tagged with "insider"

View all tags

The Prompt Made Sense Last Year: Institutional Knowledge Decay in AI Systems

· 10 min read
Tian Pan
Software Engineer

There's a specific kind of dread that hits when you inherit an AI system from an engineer who just left. The system prompts are hundreds of lines long. There's a folder called evals/ with 340 test cases and no README. A comment in the code says # DO NOT CHANGE THIS — ask Chen and Chen is no longer reachable.

You don't know why the customer support bot is forbidden from discussing pricing on Tuesdays. You don't know which eval cases were written to catch a regression from six months ago versus which ones are just random examples. You don't know if the guardrail blocking certain product categories was a legal requirement, a compliance experiment, or something someone added because a VP saw one bad output.

The system still works. For now. But you can't safely change anything.

When Vector Search Fails: Why Knowledge Graphs Handle Queries Embeddings Can't

· 9 min read
Tian Pan
Software Engineer

Vector search has become the default retrieval primitive for RAG systems. Embed your documents, embed the query, find nearest neighbors — it's simple, fast, and works surprisingly well for a wide class of questions. But production deployments keep hitting the same wall: certain queries return garbage results despite high similarity scores, certain multi-document reasoning tasks fail silently, and certain entity-heavy queries degrade to random noise as complexity grows.

The issue isn't embedding quality or index size. It's that semantic similarity is the wrong abstraction for a significant class of retrieval problems. Knowledge graphs aren't a replacement for vector search — they solve a structurally different problem. Understanding which problems belong to which tool is what separates a brittle RAG pipeline from one that holds up in production.

The Last-Mile Reliability Problem: Why 95% Accuracy Often Means 0% Usable

· 9 min read
Tian Pan
Software Engineer

You built an AI feature. You ran evals. You saw 95% accuracy on your test set. You shipped it. Six weeks later, users hate it and your team is quietly planning to roll it back.

This is the last-mile reliability problem, and it is probably the most common cause of AI feature failure in production today. It has nothing to do with your model being bad and everything to do with how average accuracy metrics hide the distribution of failures — and how certain failures are disproportionately expensive regardless of their statistical frequency.

The Latency Perception Gap: Why a 3-Second Stream Feels Faster Than a 1-Second Batch

· 11 min read
Tian Pan
Software Engineer

Your users don't have a stopwatch. They have feelings. And those feelings diverge from wall-clock reality in ways that matter enormously for how you build AI interfaces. A response that appears character-by-character over three seconds will consistently feel faster to users than a response that materializes all at once after one second — even though the batch system is objectively faster. This isn't irrational or a bug in human cognition. It's a well-documented perceptual phenomenon, and if you're building AI products without accounting for it, you're optimizing for the wrong metric.

This post breaks down the psychology behind latency perception, the metrics that actually predict user satisfaction, the frontend patterns that exploit these perceptual quirks, and when streaming adds more complexity than it's worth.

LLM-Powered Data Migrations: What Actually Works at Scale

· 10 min read
Tian Pan
Software Engineer

The pitch is compelling: feed your legacy records into an LLM, describe the target schema, and let the model figure out the mapping. No hand-written parsers, no months of transformation logic, no domain expert bottlenecks. Teams have run this and gotten to 70–97% accuracy in a fraction of the time it would take traditional ETL. The problem is that the remaining 3–30% of failures don't look like failures. They look like correct data.

That asymmetry—where wrong outputs are structurally valid and plausible—is what makes LLM-powered data migrations genuinely dangerous without the right validation architecture. This post covers what the teams that have done this successfully actually built: when LLMs earn their place in the pipeline, where they silently break, and the validation layer that catches errors traditional tools cannot.

What Model Cards Don't Tell You: The Production Gap Between Published Benchmarks and Real Workloads

· 9 min read
Tian Pan
Software Engineer

A model card says 89% accuracy on code generation. Your team gets 28% on the actual codebase. A model card says 100K token context window. Performance craters at 32K under your document workload. A model card passes red-team safety evaluation. A prompt injection exploit ships to your users within 72 hours of launch.

This gap isn't rare. It's the norm. In a 2025 analysis of 1,200 production deployments, 42% of companies abandoned their AI initiatives at the production integration stage — up from 17% the previous year. Most of them had read the model cards carefully.

The problem isn't that model cards lie. It's that they measure something different from what you need to know. Understanding that gap precisely — and building the internal benchmark suite to close it — is what separates teams that ship reliable AI from teams that ship regrets.

The Model Portability Tax: How to Architect AI Systems You Can Actually Migrate

· 9 min read
Tian Pan
Software Engineer

You inherited an AI feature built on GPT-4-turbo. The model is being deprecated. Your manager wants to cut costs by switching to a newer, cheaper model. You run a quick test, metrics look passable, you ship it — and a week later, accuracy on your core use case drops 22%. Support tickets climb. You're now in a crisis migration rather than a planned one.

This is the model portability tax: the hidden engineering cost that accumulates every time you couple your application logic tightly to a specific foundation model. Every team pays it. Most don't realize how large the bill has gotten until the invoice arrives.

Multi-User AI Sessions: The Context Ownership Problem Nobody Designs For

· 9 min read
Tian Pan
Software Engineer

In August 2024, security researchers discovered that Slack AI would pull both public and private channel content into the same context window when answering a query. An attacker in a public channel could craft a message that, when ingested by Slack AI, would inject instructions into a victim's session — and since Slack AI doesn't cite its sources, the resulting data exfiltration was nearly untraceable. The attack could leak API keys embedded in private DMs. Slack patched it after responsible disclosure.

This wasn't a bug in the traditional sense. It was a consequence of treating context as a shared mutable resource with no per-user access control. And it's a mistake that most teams building shared AI assistants are making right now, just more quietly.

The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs

· 11 min read
Tian Pan
Software Engineer

Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.

Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.

Organizational Antibodies: Why AI Projects Die After the Pilot

· 11 min read
Tian Pan
Software Engineer

The demo went great. The pilot ran for six weeks, showed clear results, and the stakeholders in the room were impressed. Then nothing happened. Three months later the project was quietly shelved, the engineer who built it moved on to something else, and the company's AI strategy became a slide deck that said "exploring opportunities."

This is the pattern that kills AI initiatives. Not technical failure. Not insufficient model capability. Not even budget. The technology actually works — research consistently shows that around 80% of AI projects that reach production meet or exceed their stated expectations. The problem is the 70-90% that never get there.

The Silent Corruption Problem in Parallel Agent Systems

· 12 min read
Tian Pan
Software Engineer

When a multi-agent system starts behaving strangely — giving inconsistent answers, losing track of tasks, making decisions that contradict earlier reasoning — the instinct is to blame the model. Tweak the prompt. Switch to a stronger model. Add more context.

The actual cause is often more mundane and more dangerous: shared state corruption from concurrent writes. Two agents read the same memory, both compute updates, and one silently overwrites the other. The resulting state is technically valid — no exceptions thrown, no schema violations — but semantically wrong. Every agent that reads it afterward reasons correctly over incorrect information.

This failure mode is invisible at the individual operation level, hard to reproduce in test environments, and nearly impossible to distinguish from model error by looking at outputs alone. O'Reilly's 2025 research on multi-agent memory engineering found that 36.9% of multi-agent system failures stem from interagent misalignment — agents operating on inconsistent views of shared information. It's not a theoretical concern.

The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter

· 10 min read
Tian Pan
Software Engineer

When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?

The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.