Skip to main content

907 posts tagged with "insider"

View all tags

Negative Prompts Are Code Smells: Why Every 'Don't' in Your System Prompt Is Technical Debt

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any production AI feature that has been live for more than three months. Count the negative clauses — the "do not," "never say," "avoid," "under no circumstances," "you must not." If the count is in the double digits, you are not looking at a system prompt. You are looking at a graveyard. Each tombstone marks a specific user complaint, a specific incident report, a specific Slack message from a stakeholder who saw the model do something embarrassing. The team patched the surface and moved on, and now the prompt reads like a legal disclaimer with a personality grafted onto the front.

Negative prompts are code smells. Not in the metaphorical sense — in the literal one. They are the prompt-engineering equivalent of a try/except block that swallows an exception, a config flag with no documentation, a // TODO: refactor this from 2022. They work, kind of, until they don't. And the failure mode they hide is almost always more interesting than the failure they were added to suppress.

OAuth in MCP: Threading User Identity Through Tool Servers

· 10 min read
Tian Pan
Software Engineer

The first time you wire an MCP server into a real production system, you discover something the tutorials gloss over: the protocol gives the agent capabilities, but it does not give the tool server an answer to the question every audit log requires — which human is this acting on behalf of? You can ship a working demo without resolving that question. You cannot ship to a regulated enterprise without resolving it. And the gap between those two states is almost entirely a distributed-systems problem dressed up as an OAuth problem.

What teams reach for in that gap, in roughly the order they reach for it, is a tour of every anti-pattern the OAuth working group has spent fifteen years warning against. A shared service account in the MCP server's environment. A long-lived per-user token pasted into a config. A cheerful "we'll just forward the user's session cookie and let the downstream service figure it out." Each one works in staging. Each one breaks in a different way the first time security review actually looks at it.

The Policy File: Why Your Refusal Rules Don't Belong in Your System Prompt

· 11 min read
Tian Pan
Software Engineer

A safety reviewer at a fintech startup pushed a four-line addition to the system prompt last quarter. The change: a refusal rule preventing the assistant from giving specific tax advice for a jurisdiction the company didn't have a license to operate in. Reasonable, narrow, audit-clean. The rule landed on Tuesday. By Friday the eval suite was showing a 7-point drop on a customer-onboarding flow that had nothing to do with tax — the model had started hedging on every question that mentioned a country, including "what currency does this account hold." The product team backed out the change. The safety team re-shipped it the following week with slightly different wording. Three weeks later, the same regression appeared in a different shape, and the next safety edit broke a different unrelated flow.

The bug here isn't the wording. The bug is that the refusal rule is in the wrong place. It's wedged inside a 2,400-token artifact that also contains the assistant's conversational voice, its formatting contract, its task instructions, and a half-dozen other policy clauses — and every edit to any of those concerns is a behavioral edit to all of them, because the model can't tell which sentence is policy and which is style. Production system prompts grow into a tangled monolith because three orthogonal concerns are pretending to be one. The teams who haven't factored them out are paying the integration tax on every edit.

Prompt Edits Aren't Wording Changes: A Code Review Discipline for Prompts as Software

· 11 min read
Tian Pan
Software Engineer

A six-line system prompt edit lands in a pull request on Tuesday afternoon. The diff is in plain English. Two reviewers eyeball the new wording, agree it reads more naturally, hit approve. The PR merges in under a minute. By Friday, support is fielding tickets about an agent that suddenly refuses to summarize documents over a certain length, won't quote sources, and inexplicably starts every reply with "Certainly!" — a behavior nobody asked for and the diff didn't predict.

This is what happens when a team that has spent a decade learning to review code regresses to first-week behavior the moment the artifact is a prompt. The diff looks harmless because it reads like English, and English is what humans review with their eyes. The discipline that makes code review work — running the tests, examining the blast radius, treating "small changes" with appropriate skepticism — quietly does not transfer. The wording got better; the behavior got worse; nobody noticed until users did.

The Session Boundary Problem: Where a Conversation Ends for Billing, Eval, and Memory

· 11 min read
Tian Pan
Software Engineer

Three teams are looking at the same event stream, each with a column called session_id, and each with a different definition of what a session is. Billing inherited a 30-minute idle window from the auth library. Eval inherited "everything until the user says 'bye' or stops typing for 10 minutes" from a chatbot framework. Memory uses a thread ID that the UI generates whenever the user clicks "New chat" — which most users never do. Three columns, three semantics, one rolled-up dashboard, three unrelated bugs that share a root cause.

This is the session boundary problem. It looks like an instrumentation nit, but it is actually a product question wearing infrastructure clothes: where does a conversation end? The honest answer is that there is no single answer — a session for billing is not the same object as a session for eval is not the same object as a session for memory — and a team that picks one default and lets the other two inherit it is shipping a billing dispute, an eval bias, and a memory leak with the same root cause.

The Show Your Work UX Trap: When the Reasoning Trace Is Debug Output Wearing a Product Costume

· 11 min read
Tian Pan
Software Engineer

A reasoning model emits a chain-of-thought trace because that is how it computes. A product team renders that trace in the UI because hiding it feels like throwing away tokens the user paid for. Those are two different decisions, and almost nobody on the product side notices they made the second one. The trace becomes a panel, the panel becomes a feature, the feature gets a docs page, and six months later someone in a quarterly review asks why the support queue is full of users arguing with the reasoning instead of the answer.

The trace is debug output. It exists for engineers who need to know why the model picked one tool, hedged on a date, or quietly switched personas mid-paragraph. Pushing it to the end user without a design pass is the AI-product equivalent of leaving console.log calls in production and calling them "transparency." It looks like a feature, it costs almost nothing to render, and it quietly degrades trust in ways that don't show up in any of the dashboards the team built.

The Summary Tax: When Compaction Eats More Tokens Than It Saves

· 10 min read
Tian Pan
Software Engineer

A long-running agent crosses its compaction threshold every twelve turns. Each pass costs an LLM call sized to the running window — first 8K tokens, then 14K, then 22K — because the span being summarized grows with every trigger. By turn sixty, the user has spent more tokens watching the agent re-summarize itself than they spent on the actual reasoning that mattered. The cost dashboard reads "user inference cost" as a single number, blissfully unaware that half of it paid for compression of context the user will never look at again.

This is the summary tax: a class of overhead that scales with conversation length, fires invisibly between user turns, and shows up as a single line item that conflates the work the user paid for with the bookkeeping the system did to manage itself. It is the closest thing modern agent architectures have to garbage-collection pause time — and most teams are running production with -verbose:gc turned off.

We Already Have That: When AI Features Reinvent Code You Already Own

· 11 min read
Tian Pan
Software Engineer

A team I worked with shipped a "smart" date extractor last quarter. The model parsed natural-language phrases like "next Tuesday" and "two weeks from the 14th," ran in production behind a feature flag, and cost about three cents per request at the chosen tier. Six weeks later, a backend engineer wandered into a design review and mentioned, casually, that the company already had a date parser. It had been written in 2019, lived in a utility module nobody on the AI team had read, handled 99.4% of the same inputs at sub-millisecond latency, and ran for free. The AI feature did not get pulled. It got rationalized — "the model handles the long tail" — and the team moved on, having shipped a more expensive, slower, less accurate version of something the company already owned.

This is not a one-off story. It is the dominant failure mode for AI features inside companies older than the AI team. The pattern repeats: a smart classifier duplicates a regex pipeline written years ago, a retrieval system fetches a vendor list that an internal service has been maintaining as a typed table, an agent learns to extract entities a parser already extracts deterministically. The AI feature ships with a quality bar lower than the deterministic system it didn't know existed, and the team who built the deterministic system finds out at a cross-team meeting.

The 80% Trap: How Aggregate RAG Metrics Hide Systematic Long-Tail Failures

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline hit 80% retrieval accuracy on the eval set. The team ships it. Three weeks later, a customer complains that the system confidently answers questions about your product's legacy integration in ways that are flatly wrong. You investigate, run the query through your pipeline, and it retrieves perfectly relevant documents — for the general topic. The three specific documents that cover the legacy integration edge case are sitting in your corpus, never surfaced.

That 80% number was real. It was also nearly useless as a signal for what just happened.

The Sparse Signal Problem: Measuring AI Feature Quality When You Can't A/B Test

· 11 min read
Tian Pan
Software Engineer

You've shipped an AI writing assistant to your enterprise customers. Twenty-three people use it every day. Your product manager is asking whether the new summarization model is actually better than the old one. You have two weeks before the next sprint, and you need a decision.

So you reach for A/B testing — and immediately discover the math doesn't work. To detect a 10% relative improvement in a 20% baseline task-completion rate, at 80% statistical power, you need roughly 1,570 users per arm. At 23 daily users, you'd need 136 days to accumulate enough data. The feature will be deprecated before the test concludes.

This is the sparse signal problem. It isn't a B2B startup edge case. Most AI features — even in established products — are used by a narrow slice of users who do specific, high-value tasks. The evaluation methodology that works for consumer recommendation engines at scale breaks down completely in this environment. What follows is how to build a measurement system that actually works when you can't A/B test.

The Write Side of the Agent: Designing for Reversibility at the Action Layer

· 11 min read
Tian Pan
Software Engineer

A Cursor agent running an AI coding assistant encountered a credential mismatch while working on a production database. It resolved the problem by deleting everything it couldn't access — the production database, its backups, and the ancillary records. The operation took nine seconds. Customers lost reservations. The company spent days reconstructing records from payment processor emails.

The agent had not been told to preserve data. It had also not been told not to delete it. There was no write journal, no staging step, no confirmation gate on destructive operations, and no separation between the agent's API token scope and full database access. The agent found the most direct path to satisfying its immediate objective and took it.

The AI A/B Test That Lied: Novelty, Carryover, and Anchoring Bias in LLM Experiments

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped with confidence. The A/B test showed a statistically significant 12% lift in user engagement. The confidence intervals didn't overlap. The sample size was right. The p-value was comfortably under 0.05. Six weeks later, the metric has flat-lined back to baseline. Three months in, it's actually below baseline. The experiment told you the feature worked. The experiment lied.

This isn't a bug in your statistical tooling. It's a fundamental mismatch between what standard A/B testing measures and what happens when humans interact with probabilistic AI systems over time. Three specific biases — novelty inflation, anchoring, and carryover — conspire to inflate every AI feature experiment, and the standard remedy of adding a holdout group doesn't fix any of them.