When PMs, support, and sales start reading the system prompt to learn what the product does, it is both a flattering signal and a structural failure. Here is how to keep the part that works and fix the rest.
Every production system prompt has three authors — engineering, product, and ML — and none of them agree on what a change is. Here's the structural fix.
A four-word planner edit moves the verifier's pass rate three points downstream. The fix is to treat your agent's prompt portfolio like a microservice mesh — graph, edges, contracts, blast-radius PR reviews, per-edge regression evals, edge owners.
Foundation-model providers retire models on a cadence your team did not plan for. Treating each migration as a one-off project pays the same setup tax three or four times a year. Run a quarterly drill instead — DRI, candidate model, regression rerun, runbook refresh — so the next deprecation email lands inside a rhythm the team already has.
A RAG system retrieves docs about a feature you deleted four months ago and confidently walks the customer into a button that doesn't exist. The evals stay green. Here's why retrieval and grounding metrics miss this entire class of failure, and what has to change at the org level to fix it.
Rater throughput caps eval velocity in any AI system that takes human grading seriously. Here is the operational discipline — calibration cycles, queue-aware prioritization, rubric feedback loops — that treats annotation capacity as an SRE problem rather than a recruiting one.
Per-turn eval scores can stay green while users rephrase the same question three times and churn. The failure lives at the session level — and here is how to detect and grade it.
The regenerate button feels like a free UX win, but ships a behavioral retrain that teaches users to treat your model as a slot machine. The design space — pagination, branching, guided regen, reroll budgets — and how to instrument reroll rate as the highest-bandwidth quality signal your product has.
Adding citations to a RAG system looks like a one-line system-prompt change. In regulated tenants it quietly multiplies inference cost 25–40%. Here is why the tax is structural, and the architectural moves that buy most of the cost back.
A two-pass agent shape — wasteful first draft, then clean execution from a constrained context — often beats n-of-k self-critique loops on both quality and cost.
Private eval notebooks feel productive but leave the org with no rollup. The fix is a merge-gate contract: shared harness, blessed slices, named owners, and a leaderboard anyone can rerun.
The canonical examples in your system prompt are quietly teaching the model a product that no longer exists. Eval scores stay green because the eval set decayed with them.