158 posts tagged with "prompt-engineering"

Your AI Explainer Doc Is a Runtime Dependency, Not Marketing Copy

April 28, 2026 · 12 min read

Software Engineer

A team I worked with last quarter shipped an AI assistant with a tidy stack of supporting documents: an in-product tooltip warning that the AI may produce inaccurate results, a help-center article titled "How does the assistant work," an internal support runbook for handling escalations, and a public model card listing the underlying model, the tools the assistant could call, and the data domains it covered. The launch went well. Six months later the prompt had been edited fourteen times, the model had been swapped from one tier to another with subtly different refusal behavior, two new tools had been added, one tool had been deprecated but not removed from the prompt, and the language settings had been opened from English-only to nine locales.

Every single one of those documents was wrong. Not catastrophically wrong — the kind of wrong where a sentence is half-true, a capability is described in language the model no longer matches, a refusal pattern is documented that the new model never triggers, a tool name appears in the help article that the assistant won't actually call. The kind of wrong that produces a slow drip of confused support tickets, a few customer trust regressions when the AI does something the docs say it won't, and — because the company sells into a regulated vertical — a small but real compliance gap that nobody on the AI team had thought to track.

The 'Try a Bigger Model' Reflex Is a Refactor Smell

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

Eval Differential as Branch Protection: Ship Score Diffs, Not Score Floors

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

A team I worked with had a clean-looking eval gate: every prompt PR had to score above 0.85 on the golden set or the merge button stayed grey. They were proud of it. Six weeks in, average quality had quietly drifted from 0.93 to 0.87 — every PR cleared the bar, every PR landed, and no individual change owned the regression because none of them broke the rule. The bar was set against a snapshot of last quarter's quality, not against last week's.

That's the failure mode of an absolute-threshold eval gate: a PR that drops the score from 0.92 to 0.86 ships green, while a PR that lifts the score from 0.80 to 0.84 fails the same gate. The team learns "ship if it clears the bar" — a quality story. The signal you actually want is "ship if this change is non-regressive on the slices that matter" — a regression-detector story.

Coverage tools figured this out a decade ago. They report the diff against the parent commit and they break it down per file. Eval gates haven't caught up.

The Model-Preference Fork: Why Your Prompt Library Has Three Versions and No One Is Tracking the Drift

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

Open the prompt library of any team that has been shipping LLM features for more than a year and you will find the same thing: three slightly different versions of every prompt. One was tuned by the engineer who likes Sonnet for its instruction-following. One was rewritten by the engineer who switched to Haiku for the latency budget. One belongs to the prototype that only ever worked on Opus and never got migrated. Each version has a slightly different system message, a different way of describing the tool catalog, a different formatting nudge — and nobody is tracking how they drift.

This is not a hygiene problem. It is a coordination tax that compounds at every model upgrade, and it is silently breaking the relationship between your eval suite and your production traffic. The library is supposed to be a shared resource. In practice, every feature ships with whichever variant the author last tested, the eval suite runs against the variant the eval-author preferred, and the routing layer chooses among them based on cost rather than on which variant was actually validated against the live eval.

The team that doesn't notice is the team that's already paying.

Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change

April 28, 2026 · 9 min read

Tian Pan

Software Engineer

A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."

This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.

A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.

Prompt Linting Is the Missing Layer Between Eval and Production

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The incident report read like a unit-test horror story. A prompt edit removed a five-line safety clause as part of a "preamble cleanup." Every eval in the suite passed. Every judge score held within tolerance. Two weeks later, a customer-facing assistant produced a response that should have been refused, the kind that triggers a Trust & Safety page at 11pm. The post-mortem traced the regression to a single deletion in a PR that nobody had flagged because the suite that was supposed to catch regressions had no opinion on whether the safety clause was present — it only had opinions on whether the model behaved well in the cases the suite remembered to ask about.

This is the gap between behavioral evals and structural correctness. Evals measure what the model produces; they do not measure what the prompt is. And prompts, like code, have a structural layer that exists independently of behavior — sections that must be present, references that must resolve, variables that must interpolate, length budgets that must hold, deprecated identifiers that must not appear. When that structural layer breaks, the behavior often stays green for a while, until the right edge case in production surfaces the failure as an incident.

Prompt Position Is Policy: The Silent Merge Conflict When Three Teams Co-Own a System Prompt

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The diff in your prompt repo says three lines changed. The behavioral diff in production says everything changed. The safety team moved a refusal rule from line 14 to line 87 to "group it with related guardrails," the product team didn't notice because the wording was identical, and a week later the eval suite is showing a 9-point drop on adversarial inputs. Nobody edited the rule. Somebody moved it. In a 2,400-token system prompt with primacy bias on guardrails and recency bias on instruction-following, moving a rule is a behavioral change as load-bearing as rewriting it — and your tooling surfaces neither.

This is the merge-conflict pattern that AI teams discover at the end of a regression review, not the beginning of one. The system prompt grew past 2K tokens sometime in late 2025. The safety team owns the top, the product team owns the middle, the agent team owns the bottom, and three months of "small edits" have silently rearranged everyone else's intent because the line-based diff tool that worked fine for code can't tell you that an instruction crossed a section boundary. The bug isn't in any single edit. The bug is that position is now policy, and you have no policy on position.

The Customer Record Hiding in Your Few-Shot Prompt Template

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The privacy auditor's question came two days before the SOC 2 renewal: "Why is the email field in your onboarding prompt's example a real customer address?" The product team rebuilt the chain in their heads. A year earlier, when they shipped the AI summarizer, someone needed a "see how this works" example for the few-shot template. They picked a representative customer record from staging, scrubbed the obvious fields — name, account ID, phone — and committed the file. The customer churned six months later. Their record was deleted from the database per the data retention policy. Their record was not deleted from the prompt template, which had been shipped to every tenant in production.

The team had assumed, like most teams, that the privacy boundary was the database. The prompt template was code. Code goes through review. Review doesn't flag PII because reviewers aren't looking for it in YAML strings labeled example_input:. The DLP scanner that catches PII in Slack messages and email attachments doesn't scan committed code, and even if it did, it wouldn't recognize a partially-scrubbed customer record as personal data because the fields it knew to look for had been removed. Everything that remained — the company size, the industry, the rare job title, the specific city — was data the scanner had no rule for.

Tool Schemas Are Prompts, Not API Contracts

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

The most expensive line in your agent codebase is the one that auto-generates tool schemas from your existing OpenAPI spec. It looks like a clean engineering choice — single source of truth, no duplication, auto-sync on every API change. It is also why your agent picks searchUsersV2 when it should have picked searchUsersV3, fills limit=20 because your spec's example said so, and silently drops the tenant_id because it was buried in the seventh parameter slot.

Nothing about this shows up in unit tests. The schema validates. The endpoint exists. The agent's call is well-formed JSON. And yet the model uses the tool wrong, every time, in ways your QA pipeline never sees because it tests the API, not the agent's reading of the API.

The bug is conceptual. OpenAPI was designed to describe APIs to humans who write SDK code; tool schemas are read by an LLM at every single call as a piece of the prompt. Treating them as the same artifact is the same category mistake as auto-generating user-facing copy from your database column names.

The Two-Language Problem: Why Type Safety Stops at the Prompt Boundary

April 28, 2026 · 10 min read

Tian Pan

Software Engineer

Your codebase has two languages, and only one of them has a compiler. There is the strictly-typed code your team writes — TypeScript with strict: true, Python with mypy in CI, Go with its enforced returns — and then there is the prompt: a templated string that gets concatenated, sent to a remote model, and returns another string the runtime hopes to parse. Between those two regions, the type system goes blind. The IDE highlights nothing. The compiler complains about nothing. And the team that ships a feature on the strength of "but it typechecks" has put the load-bearing contract somewhere the contract checker cannot see.

The seam is well-disguised. From the outside it looks like a function call: generate(input: UserQuery): Promise<AgentResponse>. The signature is honest about what flows in and what flows out. The dishonest part is what happens between the call site and the response: the input is interpolated into a prompt template that references field names by string, the model is asked to produce a JSON object that conforms to a schema described in prose inside that prompt, the response comes back as a string that gets handed to a parser, and the parser returns something the type system can finally see again. Every typed expression on either side is asserting things about a region in the middle that has no static guarantees at all.

This isn't a theoretical concern. Teams report a baseline 10–20% schema-failure rate on naive structured outputs in production, and the failures concentrate on exactly the inputs where you can least afford a silent drop — long contexts, deep tool chains, edge-case users. The type system gave a false sense of correctness right up to the moment the malformed JSON came back and the runtime swallowed it.

The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart

April 28, 2026 · 11 min read

Tian Pan

Software Engineer

A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.

This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.

The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.

The 30-Day Prompt Apprenticeship: Onboarding Engineers When 'Read the Code' Doesn't Work

April 27, 2026 · 12 min read

Tian Pan

Software Engineer

A senior engineer joins your team on Monday. By Friday they've shipped a TypeScript refactor that touches eleven files and passes review with two nits. The same engineer, two weeks later, opens the system prompt for your routing agent — 240 lines of instructions, three numbered example blocks, four "you must never" clauses, and a paragraph at the bottom that reads like an apology — and stares at it for an hour. They cannot tell you what would happen if you deleted lines 87–94. Neither can the engineer who wrote them six months ago.

This is the gap nobody puts on the onboarding doc. A prompt-heavy codebase looks like a codebase, lives in the same repo, runs through the same CI, and gets reviewed in the same PRs. But its semantics live somewhere else: in the observed behavior of a model that nobody on the team built, against a distribution of inputs nobody fully enumerated, with failure modes that surface as PRs to add a sentence rather than as bug reports. The traditional tools of code reading — types, signatures, tests, naming — do almost no work. A new hire who tries to "read the code" learns nothing about why each line is there, and a team that hands them a Notion doc and a Slack channel is implicitly outsourcing onboarding to the prompt's original author.

About Tian Pan