Skip to main content

907 posts tagged with "insider"

View all tags

Retrieval Monoculture: Why Your RAG System Has Systematic Blind Spots

· 10 min read
Tian Pan
Software Engineer

Your RAG system's evals look fine. NDCG is acceptable. The demo works. But there's a category of failure no single-metric eval catches: the queries your retriever never even gets close on, consistently, because your entire embedding space was never equipped to handle them in the first place.

That's retrieval monoculture. One embedding model. One similarity metric. One retrieval path — and therefore one set of systematic blind spots that look like model errors, hallucination, or user confusion until you actually examine the retrieval layer.

The fix is not a bigger model or more data. It's understanding that different query structures need different retrieval mechanisms, and building a system that stops routing everything through the same funnel.

Sandboxing Agents That Can Write Code: Least Privilege Is Not Optional

· 12 min read
Tian Pan
Software Engineer

Most teams ship their first code-executing agent with exactly one security control: API key scoping. They give the agent a GitHub token with repo:read and a shell with access to a working directory, and they call it "sandboxed." This is wrong in ways that become obvious only after an incident.

The threat model for an agent that can write and execute code is categorically different from the threat model for a web server or a CLI tool. The attack surface isn't the protocol boundary anymore — it's everything the agent reads. That includes git commits, documentation pages, API responses, database records, and any file it opens. Any of those inputs can contain a prompt injection that turns your research agent into a data exfiltration pipeline.

Shadow Traffic for AI Systems: The Safest Way to Validate Model Changes Before They Ship

· 10 min read
Tian Pan
Software Engineer

Most teams ship LLM changes the way they shipped web changes in 2005 — they run some offline evals, convince themselves the numbers look fine, and push. The surprise comes on Monday morning when a system prompt tweak that passed every benchmark silently breaks the 40% of user queries that weren't in the eval set.

Shadow traffic is the fix. The idea is simple: run your candidate model or prompt in parallel with production, feed it every real request, compare the outputs, and only expose users to the current version. Zero user exposure, real production data, and statistical confidence before anyone sees the change. But applying this to LLMs requires rethinking almost every piece of the implementation — because language models are non-deterministic, expensive to evaluate, and produce outputs that can't be compared with a simple diff.

The Shared Prompt Service Problem: Multi-Team LLM Platforms and the Dependency Nightmare

· 10 min read
Tian Pan
Software Engineer

On a Tuesday afternoon, the platform team at a mid-size AI startup merged a "minor improvement" to the shared system prompt. By Thursday, three separate product teams had filed bugs. One team's evaluation suite dropped from 87% to 61% accuracy. Another team's RAG pipeline started producing hallucinated citations. A third team's safety filter stopped catching a category of harmful outputs entirely. Nobody connected the dots for four days.

This is the shared prompt service problem, and it's coming for every organization that has more than one team building on a common LLM platform.

The Skill Atrophy Trap: How AI Assistance Silently Erodes the Engineers Who Use It Most

· 10 min read
Tian Pan
Software Engineer

A randomized controlled trial with 52 junior engineers found that those who used AI assistance scored 17 percentage points lower on comprehension and debugging quizzes — nearly two letter grades — compared to those who worked unassisted. Debugging, the very skill AI is supposed to augment, showed the largest gap. And this was after just one learning session. Extrapolate that across a year of daily AI assistance, and you start to understand why senior engineers at several companies quietly report that something has changed about how their team reasons through hard problems.

The skill atrophy problem with AI tooling is real, it's measurable, and it's hitting mid-career engineers hardest. Here's what the research shows and what you can do about it.

SRE for AI Agents: What Actually Breaks at 3am

· 10 min read
Tian Pan
Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

Stateful Multi-Turn Conversation Infrastructure: Beyond Passing the Full History

· 11 min read
Tian Pan
Software Engineer

Every demo of a conversational AI feature does the same thing: pass a list of messages to the model and print the response. The happy path works, looks great in a Jupyter notebook, and gets you a green light to ship. Then you get to production, and your p99 latency starts creeping up during peak hours. A month later, a customer complains that the assistant "forgot" everything from earlier in the session. Six weeks after that, your session store hits its memory ceiling during a product launch.

The fundamental problem is that "pass the full conversation history" is not a session management strategy. It is the absence of one.

What Structured Outputs Actually Cost You: The JSON Mode Quality Tax

· 9 min read
Tian Pan
Software Engineer

Most teams adopt structured outputs because they're tired of writing brittle regex to extract data from model responses. That's a reasonable motivation. What they don't anticipate is discovering months later, when they finally measure task accuracy, that their "reliability improvement" also degraded the quality of the underlying content by 10 to 15 percent on reasoning-heavy tasks. The syntactic problem was solved. A semantic one was introduced.

This post is about understanding that tradeoff precisely — what constrained decoding actually costs, when the tax is worth paying, and how to build the evals that tell you whether it's hurting your system before you ship.

Synthetic Seed Data: Bootstrapping Fine-Tuning Before Your First Thousand Users

· 9 min read
Tian Pan
Software Engineer

Fine-tuning a model is easy when you have data. The brutal part is the moment before your product exists: you need personalization to attract users, but you need users to have personalization data. Most teams either skip fine-tuning entirely ("we'll add it later") or spend weeks collecting labeled examples by hand. Neither works well. The first produces a generic model users immediately recognize as generic. The second is slow enough that by the time you have data, the task has evolved.

Synthetic seed data solves this — but only when you understand exactly where it breaks.

Your RAG Knows the Docs. It Doesn't Know What Your Engineers Know.

· 10 min read
Tian Pan
Software Engineer

Your enterprise just deployed a RAG system. You indexed every Confluence page, every runbook, every architecture doc. Six months later, a senior engineer leaves — the one who knows why the payment service has that unusual retry pattern, why you never scale the cache past 80%, and which vendor never to call on Fridays. That knowledge was never written down. Your RAG system has no idea it existed.

This is the tacit knowledge problem, and it's why most enterprise AI systems underperform not because of retrieval quality or hallucination, but because the knowledge they need was never captured in the first place. Sixty percent of employees report that it's difficult or nearly impossible to get crucial information from colleagues. Ninety percent of organizations say departing employees cause serious knowledge loss. The documents your RAG can index are only the tip.

The User Adaptation Trap: Why Rolling Back an AI Model Can Break Things Twice

· 9 min read
Tian Pan
Software Engineer

You shipped a model update. It looked fine in offline evals. Then, two weeks later, you notice your power users are writing longer, more qualified prompts — hedging in ways they never used to. Your support queue fills with vague complaints like "the AI feels off." You dig in and realize the update introduced a subtle behavior shift: the model has been over-confirming user ideas, validating bad plans, and softening its pushback. You decide to roll back.

Here is where it gets worse. When you roll back, a new wave of complaints arrives. Users say the model feels cold, terse, unhelpful — the opposite of what the original rollback complainers said. What happened? The users who interacted with the broken version long enough built new workflows around it. They learned to drive harder, push back more, frame questions more aggressively. The rollback removed the behavior they had adapted to, leaving them stranded.

This is the user adaptation trap. A subtly wrong behavior, left in production long enough, gets baked into user habits. Rolling it back doesn't restore the status quo — it creates a second disruption on top of the first.

Why Vision Models Ace Benchmarks but Fail on Your Enterprise PDFs

· 9 min read
Tian Pan
Software Engineer

A benchmark result of 97% accuracy on a document understanding dataset looks compelling until you run it against your company's actual invoice archive and realize it's quietly garbling 30% of the line items. The model doesn't throw an error. It doesn't return low confidence. It just produces output that looks plausible and is wrong.

This is the defining failure mode of production document AI: silent corruption. Unlike a crash or an exception, silent corruption propagates. The garbled table cell flows into the downstream aggregation, the aggregation feeds a report, the report drives a decision. By the time you notice, tracing the root cause is archaeology.

The gap between benchmark performance and production performance in document AI is real, persistent, and poorly understood by teams evaluating these models. Understanding why it exists — and how to defend against it — is the engineering problem this post addresses.