Skip to main content

861 posts tagged with "insider"

View all tags

The Prompt Debt Spiral: How One-Line Patches Kill Production Prompts

· 9 min read
Tian Pan
Software Engineer

Six months into production, your customer-facing LLM feature has a system prompt that began as eleven clean lines and has grown to over 400 tokens of conditional instructions, hedges, and exceptions. Quality is measurably worse than at launch, but every individual change seemed justified at the time. Nobody knows which clauses conflict with each other, or whether half of them are still necessary. Nobody wants to touch it.

This is the prompt debt spiral — and most teams in production are already inside it.

Prompt Localization Debt: The Silent Quality Tiers Hiding in Your Multilingual AI Product

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped with a 91% task success rate. You ran evals, iterated on your prompt, and tuned it until it hit your quality bar. Then you launched globally — and three months later a user in Tokyo files a support ticket that your AI "doesn't really understand" their input. Your Japanese users have been silently working around a feature that performs 15–20 percentage points worse than what your English users experience. Nobody on your team noticed because nobody was measuring it.

This is prompt localization debt: the accumulating gap between how well your AI performs in the language you built it for and every other language your users speak. It doesn't announce itself in dashboards. It doesn't cause outages. It just quietly creates second-class users.

Retrieval Monoculture: Why Your RAG System Has Systematic Blind Spots

· 10 min read
Tian Pan
Software Engineer

Your RAG system's evals look fine. NDCG is acceptable. The demo works. But there's a category of failure no single-metric eval catches: the queries your retriever never even gets close on, consistently, because your entire embedding space was never equipped to handle them in the first place.

That's retrieval monoculture. One embedding model. One similarity metric. One retrieval path — and therefore one set of systematic blind spots that look like model errors, hallucination, or user confusion until you actually examine the retrieval layer.

The fix is not a bigger model or more data. It's understanding that different query structures need different retrieval mechanisms, and building a system that stops routing everything through the same funnel.

Sandboxing Agents That Can Write Code: Least Privilege Is Not Optional

· 12 min read
Tian Pan
Software Engineer

Most teams ship their first code-executing agent with exactly one security control: API key scoping. They give the agent a GitHub token with repo:read and a shell with access to a working directory, and they call it "sandboxed." This is wrong in ways that become obvious only after an incident.

The threat model for an agent that can write and execute code is categorically different from the threat model for a web server or a CLI tool. The attack surface isn't the protocol boundary anymore — it's everything the agent reads. That includes git commits, documentation pages, API responses, database records, and any file it opens. Any of those inputs can contain a prompt injection that turns your research agent into a data exfiltration pipeline.

Shadow Traffic for AI Systems: The Safest Way to Validate Model Changes Before They Ship

· 10 min read
Tian Pan
Software Engineer

Most teams ship LLM changes the way they shipped web changes in 2005 — they run some offline evals, convince themselves the numbers look fine, and push. The surprise comes on Monday morning when a system prompt tweak that passed every benchmark silently breaks the 40% of user queries that weren't in the eval set.

Shadow traffic is the fix. The idea is simple: run your candidate model or prompt in parallel with production, feed it every real request, compare the outputs, and only expose users to the current version. Zero user exposure, real production data, and statistical confidence before anyone sees the change. But applying this to LLMs requires rethinking almost every piece of the implementation — because language models are non-deterministic, expensive to evaluate, and produce outputs that can't be compared with a simple diff.

The Shared Prompt Service Problem: Multi-Team LLM Platforms and the Dependency Nightmare

· 10 min read
Tian Pan
Software Engineer

On a Tuesday afternoon, the platform team at a mid-size AI startup merged a "minor improvement" to the shared system prompt. By Thursday, three separate product teams had filed bugs. One team's evaluation suite dropped from 87% to 61% accuracy. Another team's RAG pipeline started producing hallucinated citations. A third team's safety filter stopped catching a category of harmful outputs entirely. Nobody connected the dots for four days.

This is the shared prompt service problem, and it's coming for every organization that has more than one team building on a common LLM platform.

The Skill Atrophy Trap: How AI Assistance Silently Erodes the Engineers Who Use It Most

· 10 min read
Tian Pan
Software Engineer

A randomized controlled trial with 52 junior engineers found that those who used AI assistance scored 17 percentage points lower on comprehension and debugging quizzes — nearly two letter grades — compared to those who worked unassisted. Debugging, the very skill AI is supposed to augment, showed the largest gap. And this was after just one learning session. Extrapolate that across a year of daily AI assistance, and you start to understand why senior engineers at several companies quietly report that something has changed about how their team reasons through hard problems.

The skill atrophy problem with AI tooling is real, it's measurable, and it's hitting mid-career engineers hardest. Here's what the research shows and what you can do about it.

SRE for AI Agents: What Actually Breaks at 3am

· 10 min read
Tian Pan
Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

Stateful Multi-Turn Conversation Infrastructure: Beyond Passing the Full History

· 11 min read
Tian Pan
Software Engineer

Every demo of a conversational AI feature does the same thing: pass a list of messages to the model and print the response. The happy path works, looks great in a Jupyter notebook, and gets you a green light to ship. Then you get to production, and your p99 latency starts creeping up during peak hours. A month later, a customer complains that the assistant "forgot" everything from earlier in the session. Six weeks after that, your session store hits its memory ceiling during a product launch.

The fundamental problem is that "pass the full conversation history" is not a session management strategy. It is the absence of one.

What Structured Outputs Actually Cost You: The JSON Mode Quality Tax

· 9 min read
Tian Pan
Software Engineer

Most teams adopt structured outputs because they're tired of writing brittle regex to extract data from model responses. That's a reasonable motivation. What they don't anticipate is discovering months later, when they finally measure task accuracy, that their "reliability improvement" also degraded the quality of the underlying content by 10 to 15 percent on reasoning-heavy tasks. The syntactic problem was solved. A semantic one was introduced.

This post is about understanding that tradeoff precisely — what constrained decoding actually costs, when the tax is worth paying, and how to build the evals that tell you whether it's hurting your system before you ship.

Synthetic Seed Data: Bootstrapping Fine-Tuning Before Your First Thousand Users

· 9 min read
Tian Pan
Software Engineer

Fine-tuning a model is easy when you have data. The brutal part is the moment before your product exists: you need personalization to attract users, but you need users to have personalization data. Most teams either skip fine-tuning entirely ("we'll add it later") or spend weeks collecting labeled examples by hand. Neither works well. The first produces a generic model users immediately recognize as generic. The second is slow enough that by the time you have data, the task has evolved.

Synthetic seed data solves this — but only when you understand exactly where it breaks.

Your RAG Knows the Docs. It Doesn't Know What Your Engineers Know.

· 10 min read
Tian Pan
Software Engineer

Your enterprise just deployed a RAG system. You indexed every Confluence page, every runbook, every architecture doc. Six months later, a senior engineer leaves — the one who knows why the payment service has that unusual retry pattern, why you never scale the cache past 80%, and which vendor never to call on Fridays. That knowledge was never written down. Your RAG system has no idea it existed.

This is the tacit knowledge problem, and it's why most enterprise AI systems underperform not because of retrieval quality or hallucination, but because the knowledge they need was never captured in the first place. Sixty percent of employees report that it's difficult or nearly impossible to get crucial information from colleagues. Ninety percent of organizations say departing employees cause serious knowledge loss. The documents your RAG can index are only the tip.