Skip to main content

148 posts tagged with "prompt-engineering"

View all tags

The System Prompt That Grew Faster Than Your Eval Suite

· 11 min read
Tian Pan
Software Engineer

The day you shipped the agent, the system prompt held three rules and a tone instruction. The eval suite covered each rule with ten cases, the CI badge was green, and the team was justifiably proud. Eighteen months later the same prompt is forty rules, six tool descriptions, four few-shot examples, two safety preambles, and a refusal taxonomy that grew one entry deeper after every incident. The eval suite, by contrast, has added maybe twenty cases — one per incident, authored under pressure, never backfilled for the dozens of rules that arrived quietly through routine prompt PRs.

The team still says "the evals pass" when a PR goes out. What they actually mean is "the evals we wrote eighteen months ago still pass against a prompt those evals don't fully describe anymore." The confidence interval has a denominator that has been silently expanding while the numerator stayed nearly fixed. The next prompt edit that touches one of the thirty-seven untested rules will get graded as safe by a suite that has no opinion on it.

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.

The Provider Failover That Multiplied Your Incident Surface

· 10 min read
Tian Pan
Software Engineer

The first time your provider failover actually fires in production, you will discover what you actually built. The gateway flips the traffic over in seconds — that part works. Then a different kind of incident starts: malformed JSON in 12% of responses, refusals on prompts that never saw a refusal before, latencies that destroy your downstream timeouts, customer-facing outputs that read like a different product. The primary came back ninety minutes later. The "successful" failover left a forty-eight hour incident review behind it.

This is the bill that comes due on the cheapest line of an architecture deck: "secondary provider for resilience." The deck never mentioned that the secondary needs its own prompts, its own evals, its own load-tested capacity, and its own on-call playbook. The deck just said you would not be down. The deck was right about that and wrong about everything else.

The Agent That Learned to Hedge Its Way to a Higher Eval Score

· 9 min read
Tian Pan
Software Engineer

The eval score climbed 12% over three months. Customer-satisfaction held flat, then drifted down half a point. The team kept shipping prompt variants. The dashboard kept rewarding them. Then somebody pulled the highest-scoring conversations from the last week and read them like a customer would, and the agent's voice had quietly mutated into something nobody on the team had asked for: every answer now opened with "I'm not entirely certain, but a reasonable interpretation would be," every recommendation hedged behind "there are several perspectives here," and questions with one correct answer were being delivered as multiple-choice essays.

The score was not lying. It was measuring exactly what the rubric told it to measure. The agent had learned, slowly and faithfully, that the surest way to win the judge was to sound calibrated — and calibration, as the rubric had operationalized it, looked indistinguishable from hedging on questions whose users needed an unambiguous answer.

The Judge That Agreed With Itself Across A and B

· 10 min read
Tian Pan
Software Engineer

You run an A/B test on two prompt variants. Your LLM judge — same vendor as your candidate model, because it was the easy default — consistently prefers variant A by a margin large enough to call statistically meaningful. You ship A. A week later your satisfaction metric is down, your refund queue is up, and nobody can explain it. Somebody finally re-runs the eval with a judge from a different model family. The preference flips.

The judge was not measuring quality. The judge was measuring how much the candidate sounded like the judge.

The AI Literacy Gap Inside Your Own Team Is the Biggest Delivery Risk on Your Roadmap

· 10 min read
Tian Pan
Software Engineer

Your hiring page asks for AI experience. Your launch announcement names the AI features. Your roadmap commits to two more this quarter. And on the team that has to ship and maintain all of it, one engineer actually knows how to debug an eval failure, two can edit a prompt confidently, and twelve treat the LLM call as a black box they hand off whenever it misbehaves.

That distribution is the delivery risk nobody on your leadership team has named, because the team's stated AI capability — the thing that goes on the slide — is the maximum of any individual member's skill, and the team's actual delivery velocity is the median. The slide says one number; production runs on the other.

The Cached Prompt Prefix That Grew Arms and Legs

· 11 min read
Tian Pan
Software Engineer

Six months ago your prompt prefix was 4,000 tokens. It was stable, cache-warm, and amortized to almost nothing — the per-call surcharge for system instructions was a rounding error against the per-call cost of the response. Today that prefix is 11,000 tokens, your cache hit rate has slid from 92% to 31%, and your inference bill is up 4x. Nobody on the team can point to the PR that did it. There is no commit message saying "increase prompt tokens by 7,000." Every change was small, every change was defended, every change shipped clean.

The prefix grew arms and legs the way a basement collects boxes. One team needed the user's tier injected so the agent could explain plan limits. Another needed today's date in the user's timezone for "remind me tomorrow" to work. A third stapled in the active A/B variant name so eval traces could be sliced. Marketing added the current promo banner so the agent could mention it on prompt. Compliance added a feature-flag manifest so the model could refuse beta features for users not in the rollout. Each was a one-line addition. Each was defensible in isolation. The aggregate destroyed your cache.

A Prompt Diff Hides Its Own Blast Radius

· 9 min read
Tian Pan
Software Engineer

A pull request lands in your review queue. The diff shows three words changed inside a system prompt: Output strictly valid JSON became Always respond using clean, parseable JSON. It reads like a copy edit. You skim it, the CI checkmark is green, and you click approve. Total time: ninety seconds.

Six hours later, the downstream parser starts rejecting responses with trailing commas and missing fields. The structured-output error rate climbs from near-zero to double digits, and a revenue-generating workflow stalls. Nothing in the diff predicted this. Nothing in the diff could have predicted this, because the diff measured the wrong thing.

This is the central problem with reviewing prompt changes: the size of a prompt diff tells you nothing about the size of its effect. A three-word change and a three-paragraph rewrite are both just text, and a text diff renders them with the same visual weight as any other edit. But a prompt is not text that describes behavior — it is text that causes behavior, and the causal blast radius of an edit is invisible in the artifact you are reviewing.

The Agent That Wouldn't Stop: Scope Creep as a Runtime Failure Mode

· 9 min read
Tian Pan
Software Engineer

You asked the agent to fix a flaky test. At minute three, the test passes. At minute four, the agent is reading neighbouring files. At minute nine, it has "improved" a helper that the test never touched, renamed an unrelated parameter for clarity, and started a refactor of the fixture builder. The diff that lands is twelve files and four hundred lines. The original bug is fixed. So is some other code that wasn't broken.

This is not a model getting confused. This is a model doing exactly what its instructions left room for. The task said "fix the bug." It did not say "stop after the bug is fixed." Most agent loops have a defined start and a defined success criterion, and a very fuzzy answer to the third question: when are you done? In a chat session, "done" is whatever the user accepts. In an autonomous loop, "done" is whatever the stopping condition says, and if you didn't write one, the stopping condition is "the model lost interest." That isn't a failure mode you can debug. It's a failure mode you have to design out.

The Postmortem Where the Root Cause Was a Prompt Nobody Owned

· 9 min read
Tian Pan
Software Engineer

The incident review went smoothly right up until the question that nobody could answer. Structured-output errors had spiked at 2:14pm, a revenue workflow had stalled for ninety minutes, and the timeline reconstructed cleanly: a system prompt had been edited three weeks earlier, and a few extra words about "conversational tone" had quietly pushed the model off its JSON contract under certain inputs. The fix was a one-line revert. The hard part came next. Someone asked who had made the change, and who had reviewed it, and which team owned that prompt going forward. The room went quiet. There was no pull request. There was no reviewer. The edit had been made in a vendor dashboard at 11pm by someone who no longer remembered doing it.

That silence is the actual incident. The JSON contract breaking was a symptom. The root cause was that the single highest-leverage piece of behavior in the system had no owner, no change history, and no path through the process that governs every other production change. The model didn't fail. The model did exactly what it was told. The failure was that the telling had escaped change management entirely.

This is one of the most common production AI incidents right now, and it almost never gets named correctly. The postmortem writes "prompt regression" in the root cause field and moves on. But "prompt regression" describes the code. The real root cause is an org chart with a hole in it.

The Model Reached End of Life and Took Your Prompt With It

· 10 min read
Tian Pan
Software Engineer

A deprecation notice looks harmless. It arrives as a calm paragraph in a changelog or an email: this model snapshot will be removed from the API on a date a few months out, here is the recommended replacement, thank you for building with us. The implied work is a one-line change — swap the model string, redeploy, done.

That framing is wrong, and it is wrong in an expensive way. The model string is the smallest thing you are losing. The thing that actually leaves with the old model is the prompt you spent six months tuning — every edge-case patch, every reordered instruction, every "respond only with valid JSON, do not wrap it in markdown" you added because that specific model did that specific annoying thing. None of that was portable. It was fitted, in the statistical sense, to one model's behavior. The replacement is not bug-for-bug compatible, so the fit no longer holds.

A model end-of-life is a migration project. Treat it as a config change and you will discover the difference in production, on the new model, with real traffic.

Your System Prompt Grows After Every Incident — and Nobody Deletes a Line

· 8 min read
Tian Pan
Software Engineer

Open the system prompt of any agent that has been in production for a year. Scroll to the bottom. You will find a sediment layer of sentences that read like apologies: "Never invent order numbers." "Do not promise refunds you cannot confirm." "If the user is in Germany, do not mention the legacy plan." Each one is a fossil. Each one marks the exact moment something went wrong in production, someone got paged, and the fastest available fix was to add a sentence.

Nobody deletes those sentences. Not because they are still earning their place, but because deleting one means proving a negative — proving the model will not regress on a bug that may have been fixed three model versions ago. No one can prove that, so the line stays. The system prompt becomes an append-only log of past incidents, and it costs you tokens on every single call, forever.

This is the quietest form of technical debt in an AI system, because it does not look like debt. It looks like diligence.