Most production agents have a degraded-mode spec — it just lives in scattered catch blocks, untested, and the customer writes the public version of it on the next bad day.
Agent runtimes hide state in places your DR runbook never named. The fix: name the state surface, generate idempotency keys at task scope, checkpoint before every tool call, and default to fail-safe abort over fail-forward replay.
When an agent issues a wrong refund, your CRO will ask what produced it — and the answer requires a captured-at-write-time tuple of prompt, model id, decode config, tool results, and conversation history. Here is the discipline that makes 'we can reconstruct it' a true statement.
AI threat models usually stop at the model and treat output as safe content. Indirect prompt injection turns rendered markdown, structured output, generated code, and tool-call arguments into attack payloads — and the boundary worth defending is downstream of the model.
A permission prompt is a security control with a measurable half-life. Track per-user approval rate, tier friction by blast radius, and stop letting a 100% click-through rate carry your safety story.
Every agent release ships a bundle of system-prompt, model, tool, rubric, and retriever changes — and a file-diff changelog tells integrators nothing about the behavior shifts they will actually parse, budget against, or get paged on.
Request-level sampling policies break for agent traces. A per-tier policy — always-trace failures, head-sample successes, tail-sample by cost percentile — turns the trace store from a budget hole into an incident-response tool.
A four-line bug fix gets three rounds of code review. A forty-line system-prompt edit ships with a single LGTM. A field guide to closing the discipline gap on AI artifacts before it ships your next regression.
The wow demo was one realization out of thousands the model would generate against the same input. The rollout craters not because polish is missing — because nobody measured variance. Here's the n-of-k sampling, worst-case input library, and distribution-shift checklist that close the gap.
AI features compose through artifacts nobody catalogs — prompt fragments, eval seeds, judge rubrics. When a shared edit lands, three other teams regress and nobody can attribute it. Here's how to draw the graph.
When the prompt changes and the help-center article doesn't, your AI feature's trust contract breaks silently — and the prompt repo can predict the gap.
User-percentage feature flags spread the hard 5% of queries evenly across cohorts, hiding tail regressions until 100%. Ramp by difficulty, token length, query slice, or tool-call depth instead — that is the axis where AI blast radius actually lives.