Web AI features iterate in minutes; mobile AI features iterate on the platform's review clock. The architectural seams that keep both surfaces honest under one eval set and two release trains.
Swapping foundation models silently invalidates your eval baseline — human-anchored scores, LLM judges, snapshots, and team intuition all need re-anchoring, and the labor bill is usually larger than the per-token savings.
The pager fires because eval-on-traffic dropped four points, not because a service crashed. Runbook patterns, alert design, and rotation discipline for the failure modes that don't trip your existing alerts.
Per-customer system prompt customizations accumulate silently until model-migration day, when a single provider deprecation becomes 47 separate re-validations. The base-plus-overlay architecture and approval discipline that prevent it.
Prompts now move more behavior than code, yet most teams review them with 2008-era tooling. Five pre-commit hooks — formatter, linter, secret-and-PII scanner, smoke eval, and cache-impact estimator — for treating prompt edits with the rigor they actually need.
When PMs, support, and sales start reading the system prompt to learn what the product does, it is both a flattering signal and a structural failure. Here is how to keep the part that works and fix the rest.
Every production system prompt has three authors — engineering, product, and ML — and none of them agree on what a change is. Here's the structural fix.
A four-word planner edit moves the verifier's pass rate three points downstream. The fix is to treat your agent's prompt portfolio like a microservice mesh — graph, edges, contracts, blast-radius PR reviews, per-edge regression evals, edge owners.
Foundation-model providers retire models on a cadence your team did not plan for. Treating each migration as a one-off project pays the same setup tax three or four times a year. Run a quarterly drill instead — DRI, candidate model, regression rerun, runbook refresh — so the next deprecation email lands inside a rhythm the team already has.
A RAG system retrieves docs about a feature you deleted four months ago and confidently walks the customer into a button that doesn't exist. The evals stay green. Here's why retrieval and grounding metrics miss this entire class of failure, and what has to change at the org level to fix it.
Rater throughput caps eval velocity in any AI system that takes human grading seriously. Here is the operational discipline — calibration cycles, queue-aware prioritization, rubric feedback loops — that treats annotation capacity as an SRE problem rather than a recruiting one.
Per-turn eval scores can stay green while users rephrase the same question three times and churn. The failure lives at the session level — and here is how to detect and grade it.