Mature production prompts grow a list of don'ts that quietly works against itself — both leaking attack surface and increasing the rate of the very outputs it forbids.
Weekly rolling cost averages hide a cohort-mix problem every AI feature has — and the off-hours users paying 3–5x cost per active user are a structural shape, not an edge case.
Aggregated AI cost dashboards hide a power-law distribution where the top 1% of customers drive 30–50% of token spend. Build per-customer attribution, slope-based anomaly detection, and reservation-based budget enforcement before one runaway agent loop becomes a margin event.
Multi-tenant AI teams accidentally become compiler engineers the moment per-tenant prompt variance lands — and the operational bill arrives at month six. A look at why prompts at scale are build targets, not config files.
Behavior change in AI products no longer routes through PRs. The dashboards leadership trusts miss the dominant source of product change, and the misdiagnosis is reshaping how AI teams get measured.
Production prompt management treats prompts as singular winners. Treat them as a portfolio instead: weighted variants, segment-aware allocation, and weekly rebalancing.
git revert restores a deterministic past state. Prompt rollback has to reconcile with caches, conversation histories, eval baselines, and A/B cohorts the bad prompt already shaped — most teams find that out the hard way.
Quantizing an LLM from fp16 to int4 ships a different model wearing the same weights. The eval suite calibrated to the original silently grades the new one wrong — here is the capability slippage to budget for before the customers notice it first.
Per-token pricing reports the cost of the median request, not the all-in cost of the distribution your product actually serves. Routing the hard prompts to a reasoning model beats workhorse-by-default once retries, escalations, and trust damage land on the P&L.
Rerunning a failed AI prompt feels like a variance probe but acts like survivorship bias — masking deterministic bugs while burning unbudgeted tokens. Trace-first debugging and N-of-K discipline replace it.
Self-Refine, Chain-of-Verification, and reflection prompts promise big quality lifts on benchmarks — but in production they triple costs, balloon latency, and deliver a fraction of the advertised gain. Here is how to price the self-critique tax before shipping it.
Multi-turn AI features get billed by per-call dashboards but pay by per-conversation curves. The tail is super-linear, and the bill comes from there.