Skip to main content

5 posts tagged with "prompt-management"

View all tags

The AI Feature Sunset Playbook Nobody Writes

· 13 min read
Tian Pan
Software Engineer

Every AI org has a graveyard. Not of services — those get a runbook, a deprecation banner, a 30-day migration window, and a slot on the platform team's quarterly roadmap. The graveyard is of features: the smart-summary beta that never graduated, the auto-categorizer that two enterprise customers actually built workflows around, the agentic flow that demoed beautifully and shipped behind a flag that nobody flipped off. The endpoint is easy to deprecate. The four other things attached to it — the prompt, the judge, the regression set, and the incident memory — are what actually take a quarter, and nobody on the team has written the playbook because nobody has been promoted for retiring something.

This is the gap. Most of the public discourse on "model deprecation" is about vendor-side retirements: GPT-4o leaves on a date, Assistants API beta sunsets on August 26, DALL-E 3 retires on May 12, and your platform team has a notification period to migrate. That problem has playbooks because vendors publish dates, because the migration is forced, and because the work fits in a sprint. The internal version — when you decide a feature you built didn't graduate, and you have to actually take it out — has none of those forcing functions. The deprecation date is whatever you say it is. The migration path is whatever you build. And the artifacts you have to retire are not a single endpoint but a tangled stack of model-adjacent assets that your monitoring barely knows exist.

Prompt Edits Without PRs: The Velocity Metric Your AI Team Is Failing

· 9 min read
Tian Pan
Software Engineer

A head of engineering opens the velocity dashboard on a Monday morning. PRs merged per week, flat. Story points completed, flat. Lines changed, suspiciously low. The AI team is having a quiet quarter, the chart says. Two floors away, that team has rewritten the system prompt seven times in three weeks, swapped a tool description that doubled tool-call accuracy, added six new few-shot examples, and tuned the rerank instruction until the product feels like a different application. None of that work shows up in the PR graph. None of it is invisible to users.

The asymmetry between what AI teams change and what engineering dashboards measure has become the load-bearing misdiagnosis of 2026. Behavior change in an AI-heavy product is increasingly decoupled from code change, and the metrics that have governed software organizations for fifteen years — PR throughput, commit volume, lines touched — measure code change. A team can be reshaping production response distributions weekly and look idle on every chart leadership trusts.

Runtime Prompt Hot-Reload: Why Your Prompts Shouldn't Be Locked Behind a Build

· 11 min read
Tian Pan
Software Engineer

The first AI incident at most companies follows a script: a prompt-engineer notices the model is misclassifying a category that just started showing up in real traffic, opens a PR with a one-line tweak to the system prompt, and watches the build queue for the next 23 minutes while the model continues to misclassify in production. The fix is a string. The deployment is a binary. The mismatch is not a tooling oversight — it is an architectural decision the team made implicitly the day they put the system prompt in a .py file alongside the application code.

Coupling prompt changes to the deploy pipeline is a constraint you imposed on yourself. There is no law of distributed systems that says the model's behavior contract has to ship inside the same artifact as the orchestration code. The runtime prompt hot-reload pattern severs that coupling by treating prompts the way you already treat feature flags, routing rules, and pricing tables — as configuration pulled from a versioned store at request time, with a short-lived local cache and well-defined safety primitives around it. The payoff is incident-response measured in seconds rather than build minutes, and the cost is an honest accounting of a third deployment surface your release process probably ignores.

The Shared-Prompt Flag Day: When One Edit Becomes Thirty Teams' Regression

· 10 min read
Tian Pan
Software Engineer

The first edit to a shared system prompt feels like good engineering. Three teams all paste the same eighteen-line safety preamble at the top of their agents, someone notices, and an internal platform team says the obvious thing: let's centralize it. A prompts.common.safety_preamble@v1 lands in a registry. Thirty teams adopt it within a quarter because it's the path of least resistance — and because security is happy that one team owns the wording. For two quarters, this looks like a clean DRY win.

Then the security team needs a small wording change. Maybe a new compliance regulation tightens what an assistant is allowed to volunteer about a user's account. Maybe a red-team finding requires a one-sentence addition to the refusal clause. The platform team makes the edit, ships v2, and within a day the support queue fills with messages from consumer teams: our eval dropped, our format broke, our tool-call rate halved, our tone changed, our latency went up because the model started reasoning more. Each team wants the edit reverted. The security team needs it shipped. Nobody can roll forward without a re-eval, and nobody owns the re-eval. Welcome to the shared-prompt flag day.

The Prompt Ownership Problem: What Happens When Every Team Treats Prompts as Configuration

· 8 min read
Tian Pan
Software Engineer

A one-sentence change to a system prompt sat in production for 21 days before anyone noticed it was misclassifying thousands of mortgage documents. The estimated cost: $340,000 in operational inefficiency and SLA breaches. Nobody could say who made the change, when it was made, or why. The prompt lived in an environment variable that three teams had write access to, and no one considered it their responsibility to review.

This is the prompt ownership problem. As LLM-powered features proliferate across organizations, prompts have become the most consequential yet least governed artifacts in the stack. They control model behavior, shape user experience, enforce safety constraints, and define business logic — yet most teams manage them with less rigor than they'd apply to a CSS change.