Skip to main content

578 posts tagged with "insider"

View all tags

What Model Cards Don't Tell You: The Production Gap Between Published Benchmarks and Real Workloads

· 9 min read
Tian Pan
Software Engineer

A model card says 89% accuracy on code generation. Your team gets 28% on the actual codebase. A model card says 100K token context window. Performance craters at 32K under your document workload. A model card passes red-team safety evaluation. A prompt injection exploit ships to your users within 72 hours of launch.

This gap isn't rare. It's the norm. In a 2025 analysis of 1,200 production deployments, 42% of companies abandoned their AI initiatives at the production integration stage — up from 17% the previous year. Most of them had read the model cards carefully.

The problem isn't that model cards lie. It's that they measure something different from what you need to know. Understanding that gap precisely — and building the internal benchmark suite to close it — is what separates teams that ship reliable AI from teams that ship regrets.

The Model Portability Tax: How to Architect AI Systems You Can Actually Migrate

· 9 min read
Tian Pan
Software Engineer

You inherited an AI feature built on GPT-4-turbo. The model is being deprecated. Your manager wants to cut costs by switching to a newer, cheaper model. You run a quick test, metrics look passable, you ship it — and a week later, accuracy on your core use case drops 22%. Support tickets climb. You're now in a crisis migration rather than a planned one.

This is the model portability tax: the hidden engineering cost that accumulates every time you couple your application logic tightly to a specific foundation model. Every team pays it. Most don't realize how large the bill has gotten until the invoice arrives.

Multi-User AI Sessions: The Context Ownership Problem Nobody Designs For

· 9 min read
Tian Pan
Software Engineer

In August 2024, security researchers discovered that Slack AI would pull both public and private channel content into the same context window when answering a query. An attacker in a public channel could craft a message that, when ingested by Slack AI, would inject instructions into a victim's session — and since Slack AI doesn't cite its sources, the resulting data exfiltration was nearly untraceable. The attack could leak API keys embedded in private DMs. Slack patched it after responsible disclosure.

This wasn't a bug in the traditional sense. It was a consequence of treating context as a shared mutable resource with no per-user access control. And it's a mistake that most teams building shared AI assistants are making right now, just more quietly.

The Multilingual Token Tax: What Building AI for Non-English Users Actually Costs

· 11 min read
Tian Pan
Software Engineer

Your product roadmap says "expand to Japan and Brazil." Your finance model says the LLM API line item is $X per month. Both of those numbers are wrong, and you won't discover it until the international rollout is weeks away.

Tokenization — the step that turns user text into integers your model can process — is profoundly biased toward English. A sentence in Japanese might require 2–8× as many tokens as the same sentence in English. That multiplier feeds directly into API costs, context window headroom, and response latency. Teams that model their AI budget on English benchmarks and then flip on a language flag are routinely surprised by bills 3–5× higher than expected.

Organizational Antibodies: Why AI Projects Die After the Pilot

· 11 min read
Tian Pan
Software Engineer

The demo went great. The pilot ran for six weeks, showed clear results, and the stakeholders in the room were impressed. Then nothing happened. Three months later the project was quietly shelved, the engineer who built it moved on to something else, and the company's AI strategy became a slide deck that said "exploring opportunities."

This is the pattern that kills AI initiatives. Not technical failure. Not insufficient model capability. Not even budget. The technology actually works — research consistently shows that around 80% of AI projects that reach production meet or exceed their stated expectations. The problem is the 70-90% that never get there.

The Silent Corruption Problem in Parallel Agent Systems

· 12 min read
Tian Pan
Software Engineer

When a multi-agent system starts behaving strangely — giving inconsistent answers, losing track of tasks, making decisions that contradict earlier reasoning — the instinct is to blame the model. Tweak the prompt. Switch to a stronger model. Add more context.

The actual cause is often more mundane and more dangerous: shared state corruption from concurrent writes. Two agents read the same memory, both compute updates, and one silently overwrites the other. The resulting state is technically valid — no exceptions thrown, no schema violations — but semantically wrong. Every agent that reads it afterward reasons correctly over incorrect information.

This failure mode is invisible at the individual operation level, hard to reproduce in test environments, and nearly impossible to distinguish from model error by looking at outputs alone. O'Reilly's 2025 research on multi-agent memory engineering found that 36.9% of multi-agent system failures stem from interagent misalignment — agents operating on inconsistent views of shared information. It's not a theoretical concern.

The Precision-Recall Tradeoff Hiding Inside Your AI Safety Filter

· 10 min read
Tian Pan
Software Engineer

When teams deploy an AI safety filter, the conversation almost always centers on what it catches. Did it block the jailbreak? Does it flag hate speech? Can it detect prompt injection? These are the right questions for recall. They are almost never paired with the equally important question: what does it block that it shouldn't?

The answer is usually: a lot. And because most teams ship with the vendor's default threshold and never instrument false positives in production, they don't find out until users start complaining—or until they stop complaining, because they stopped using the product.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.

RAG Knowledge Base Freshness: The Staleness Problem Teams Solve Last

· 11 min read
Tian Pan
Software Engineer

Most RAG teams spend months tuning chunk sizes, experimenting with embedding models, and debating hybrid search configurations. Then they ship to production, declare success, and move on. Six months later, users start complaining that the system gives wrong answers — and the team discovers that the index they so carefully built has quietly rotted.

Index freshness is the problem that gets solved last, usually after a customer incident rather than before. Unlike retrieval quality failures that show up immediately in evals, staleness degrades silently: latency stays flat, retrieval appears functional, and standard RAG metrics like context recall and faithfulness score well — right up until the moment your system confidently returns a policy that was updated months ago.

RAG Position Bias: Why Chunk Order Changes Your Answers

· 8 min read
Tian Pan
Software Engineer

You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.

The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

· 11 min read
Tian Pan
Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

RBAC Is Not Enough for AI Agents: A Practical Authorization Model

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents today treat authorization as an afterthought. They wire up an OAuth token, give the agent the same scopes as the human user who triggered it, and call it done. Then, months later, they discover that a manipulated prompt caused the agent to exfiltrate files, or that a compromised workflow had been silently escalating privileges across connected services.

The problem is not that RBAC is bad. It is that RBAC was designed for humans with stable job functions, and AI agents are neither stable nor human. An agent's "role" can shift from read-only research to write-capable code execution within a single conversation turn. Static roles cannot express this, and the mismatch creates a predictable vulnerability surface.