Skip to main content

136 posts tagged with "prompt-engineering"

View all tags

Prompt Portfolios: Manage a Basket, Not a Single Best Prompt

· 10 min read
Tian Pan
Software Engineer

Most production AI teams talk about prompts the way junior traders talk about stocks: there is one best one, and the job is to find it. So they iterate — a Slack thread, a few eval rows, a new winner, push to main, repeat. The result is a single artifact carrying the entire intent-resolution surface of the product, optimized against a frozen evaluation set, sitting one regrettable edit away from a P1.

The mistake is the singular. A prompt is not a security; it is an allocation. The same user intent can be served well by several variants, each with its own confidence interval, its own per-segment performance, and its own sensitivity to model and corpus drift. The right mental model is not "find the best prompt" — it is "manage a basket of prompts whose composition is itself the product." Quantitative finance figured this out fifty years ago, and the operational machinery transfers almost without modification.

Prompts Don't Roll Back Like Code: Why git revert Is the Wrong Primitive

· 9 min read
Tian Pan
Software Engineer

A senior engineer ships a prompt change behind a 10% canary. By the next morning, the canary cohort's helpfulness score has dropped four points, the on-call notices, and the team does what every team does — they revert the commit and redeploy. The dashboard does not recover. It does not recover the next day either. Three days later, a postmortem reveals that the cohort that saw the bad prompt is still seeing degraded outputs because their conversation histories now contain assistant turns produced by the rolled-back prompt, and the model is conditioning on those turns. The commit is gone. The damage is not.

This is the part of LLMOps that the "treat prompts like code" advice quietly skips. Code rollback is a text replacement that restores a deterministic past state. Prompt rollback has to reconcile with a tail of side effects — caches, histories, eval baselines, experiment cohorts, downstream contracts — that the bad prompt has already imprinted on the production world. git revert flips the text. It does not flip the consequences.

Tenancy Leaks Through Few-Shot Examples: When Your Prompt Library Becomes a Cross-Customer Data Store

· 11 min read
Tian Pan
Software Engineer

Open the production system prompt of a maturing AI product, scroll past the role description, and you will almost always find a section labeled # Examples or ## Few-shot demonstrations. The examples are excellent — they are concrete, they are domain-specific, they pattern-match exactly the failure modes the eval set was struggling with last quarter. They are also, on closer inspection, real customer data. A real ticket ID from a real account. A phrasing pattern lifted verbatim from a support thread. An internal product code that one tenant uses and the rest of the customer base has never heard of.

The team that put them there is not careless. The examples got into the prompt the way good examples always get into prompts: someone mined production traces for cases the model handled poorly, picked the cleanest worked example, pasted it into the system message, watched the eval scores climb, and shipped. That pipeline — production trace to system prompt — is the most reliable prompt-improvement loop in modern LLM engineering. It is also a structural cross-tenant data leak that the team built without noticing, and the system prompt has quietly become a multi-tenant data store the data-processing agreement never priced.

The Five Definitions of 'Now' Inside Your LLM Prompt

· 11 min read
Tian Pan
Software Engineer

A customer support agent told a user "based on our latest pricing, as of today" and quoted last quarter's price sheet. The system prompt interpolated today is {current_date} correctly. The retrieval layer pulled the document with the highest freshness score. The model answered confidently. Every component did exactly what it was specified to do, and the user got a wrong answer that the on-call engineer could not reproduce because, by the time they replayed the trace at 9pm, "today" was a different day.

This is not a rare bug. It is a failure mode that lives in almost every production LLM pipeline because "now" is implicit in the prompt at five different layers, and those layers were authored at different times, by different people, against different definitions of the present. As long as a request runs synchronously from a foreground user session, the layers mostly agree. The moment the request is replayed for debugging, batch-processed overnight, run from an eval harness pinned in March, or queued and consumed an hour later, the layers start disagreeing — and the model produces an answer that is internally consistent within its prompt but externally wrong.

The Prompt Bench Press: Stress-Testing Prompts Outside the Happy Path

· 10 min read
Tian Pan
Software Engineer

A prompt that scores 92% on your eval set and 60% on real production traffic is not a prompt with a bug. It is a prompt whose evaluation set was structurally incapable of finding the bug. The gap is not noise. It is the consequence of optimizing against examples that share a register, a length distribution, a language, and a politeness level with the prompt's design intent — the very same intent that wrote the eval cases.

Real users do not cooperate with your design intent. They send three-word fragments, twelve-paragraph essays, code blocks pasted as questions, casual register that drops articles, formal register that adds honorifics, and queries in languages your few-shot examples never used. None of this is adversarial. It is just the input distribution. And if your eval set was curated by the same person who wrote the prompt, it almost certainly looks nothing like that distribution.

The discipline that closes this gap is not "more evals." It is a different kind of eval — a stress matrix that deliberately varies the dimensions your curated set holds constant, and that grades degradation curves rather than a single accuracy number. Call it the prompt bench press: you are not testing whether the prompt can do the work. You are testing how it fails as the input gets harder.

Sampling Drift: When Temperature and Top-P Become Tribal Knowledge

· 9 min read
Tian Pan
Software Engineer

Open the production config of any AI feature that has been live for more than a year and you will find an archaeological dig site. temperature: 0.7 because someone needed the demo to feel less robotic. top_p: 0.85 because a customer complained the outputs were too generic. frequency_penalty: 0.4 because there was a bad week in 2024 where a now-retired model kept repeating itself. None of these decisions are documented. None of them have been re-tested against the current foundation model. They run on every request, in every eval, in every A/B, shaping behavior nobody has consciously chosen since the original ticket got closed.

This is sampling drift. It is the slow accumulation of expedient sampler tweaks whose original justifications evaporate while their effects compound. The values in your config are not "tuned" — they are a fossil record of past incidents, scaled to the volume of your current traffic.

The reason it is invisible is structural. Every eval you run scores against the current sampling config, so the headline number always looks fine. There is no alarm that fires when a temperature value is two foundation-model versions out of date. There is no calendar invite that says "re-grid sampling parameters this quarter." The decay is silent until somebody runs a clean experiment and finds a quality lift, a token reduction, or both, sitting in plain sight at no engineering cost.

The Specification Translation Tax: When Spec, Prompt, and Eval Drift Apart

· 11 min read
Tian Pan
Software Engineer

A PM writes a feature spec in English. An engineer translates it into a system prompt with idiomatic LLM patterns — chain-of-thought scaffolding, output format coercion, a few hedge clauses to cover failure modes the spec never mentioned. An eval author opens the same spec, re-reads it cold, and writes JSON test cases against their interpretation. Three weeks later, all three artifacts disagree, and nobody can tell whether a regression is a prompt bug, a spec-implementation gap, or an eval that was wrong from day one.

This is the specification translation tax. Traditional software has it too — the gap between PRD and code, between code and tests — but compilers and type systems narrow it. AI features have no such backstop. The prompt is documentation that the system actually reads. The eval is a contract that nobody signed. The spec is a description of intent that nobody enforces. Each is a translation of the same intent into a different medium, and without bidirectional consistency, behavior leaks in through whichever artifact is easiest to edit.

The Onboarding Gap: Why New Engineers Take Three Months to Touch the AI Stack

· 9 min read
Tian Pan
Software Engineer

A backend engineer with eight years of experience joins your team. By week three on a normal codebase, they would be shipping features. On the AI surface, they are still asking questions in DMs, and you can predict which two senior engineers they are asking. Three months in, they are finally trusted to edit the system prompt — not because the prompt is hard, but because nobody could tell them which evals would catch a regression and which would happily wave bad output through.

This is not a hiring problem or a documentation problem in the usual sense. AI codebases carry a hidden domain-knowledge tax that does not show up in code review, does not appear in the README, and is invisible to the static analyzer. The tax is paid in onboarding time, in repeated questions to the same two people, and eventually in a team that quietly bifurcates into "the people who can touch it" and "everyone else."

The Frozen Prompt: When Your Team Is Afraid to Edit a System Prompt That Works

· 13 min read
Tian Pan
Software Engineer

Every mature AI product eventually grows a system prompt that nobody on the current team fully understands. It started as forty tokens of plain English, and twenty months later it is a 4,000-token wall of conditional clauses, refusal templates, formatting rules, persona reinforcements, edge-case warnings, and one peculiar sentence about Tuesdays that nobody can explain. Each line was added in response to a specific failure: a customer complaint, a Slack ping from legal, a regression caught by an eval, a one-off bug that surfaced during an investor demo. The engineer who wrote line 37 has rotated to another team. The engineer who wrote line 112 was a contractor whose Notion doc was archived. The eval suite covers maybe a third of the behaviors the prompt is asserting, and nobody is sure which third.

So the prompt becomes load-bearing in the worst possible way: it works, the team knows it works, and the team has stopped touching it. Engineers who should be iterating on the prompt route their changes around it instead — adding a post-processing filter here, a few-shot wrapper there, a parallel "v2 prompt" feature-flagged off in case anyone ever finds the courage to A/B test the replacement. The prompt has stopped being software and has become a relic. And once that happens, the prompt is no longer the lever you use to improve the product. It's the constraint shaping it.

Negative Prompts Are Code Smells: Why Every 'Don't' in Your System Prompt Is Technical Debt

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any production AI feature that has been live for more than three months. Count the negative clauses — the "do not," "never say," "avoid," "under no circumstances," "you must not." If the count is in the double digits, you are not looking at a system prompt. You are looking at a graveyard. Each tombstone marks a specific user complaint, a specific incident report, a specific Slack message from a stakeholder who saw the model do something embarrassing. The team patched the surface and moved on, and now the prompt reads like a legal disclaimer with a personality grafted onto the front.

Negative prompts are code smells. Not in the metaphorical sense — in the literal one. They are the prompt-engineering equivalent of a try/except block that swallows an exception, a config flag with no documentation, a // TODO: refactor this from 2022. They work, kind of, until they don't. And the failure mode they hide is almost always more interesting than the failure they were added to suppress.

The Policy File: Why Your Refusal Rules Don't Belong in Your System Prompt

· 11 min read
Tian Pan
Software Engineer

A safety reviewer at a fintech startup pushed a four-line addition to the system prompt last quarter. The change: a refusal rule preventing the assistant from giving specific tax advice for a jurisdiction the company didn't have a license to operate in. Reasonable, narrow, audit-clean. The rule landed on Tuesday. By Friday the eval suite was showing a 7-point drop on a customer-onboarding flow that had nothing to do with tax — the model had started hedging on every question that mentioned a country, including "what currency does this account hold." The product team backed out the change. The safety team re-shipped it the following week with slightly different wording. Three weeks later, the same regression appeared in a different shape, and the next safety edit broke a different unrelated flow.

The bug here isn't the wording. The bug is that the refusal rule is in the wrong place. It's wedged inside a 2,400-token artifact that also contains the assistant's conversational voice, its formatting contract, its task instructions, and a half-dozen other policy clauses — and every edit to any of those concerns is a behavioral edit to all of them, because the model can't tell which sentence is policy and which is style. Production system prompts grow into a tangled monolith because three orthogonal concerns are pretending to be one. The teams who haven't factored them out are paying the integration tax on every edit.

Prompt Edits Aren't Wording Changes: A Code Review Discipline for Prompts as Software

· 11 min read
Tian Pan
Software Engineer

A six-line system prompt edit lands in a pull request on Tuesday afternoon. The diff is in plain English. Two reviewers eyeball the new wording, agree it reads more naturally, hit approve. The PR merges in under a minute. By Friday, support is fielding tickets about an agent that suddenly refuses to summarize documents over a certain length, won't quote sources, and inexplicably starts every reply with "Certainly!" — a behavior nobody asked for and the diff didn't predict.

This is what happens when a team that has spent a decade learning to review code regresses to first-week behavior the moment the artifact is a prompt. The diff looks harmless because it reads like English, and English is what humans review with their eyes. The discipline that makes code review work — running the tests, examining the blast radius, treating "small changes" with appropriate skepticism — quietly does not transfer. The wording got better; the behavior got worse; nobody noticed until users did.