Once a customer's data is in the loss function, deletion stops being a row operation and becomes a rebuild. Lineage chains, four policy choices, and the procurement clause that's now a stop-ship issue.
Production sampling configs accumulate undocumented temperature, top-p, and penalty values whose original justifications evaporate while their effects compound. A discipline for catching it.
Most agent frameworks silently clip tool outputs past a hidden byte or token cap. The model reasons over a fragment it can't see is a fragment, and the bug shows up months later as a customer escalation.
When one model writes the spec, the code, and the tests, 'all tests pass' stops being evidence the feature works — it only proves the model is internally consistent.
Spec, prompt, and eval are three translations of the same intent into different media. Without enforced consistency, they drift, and a year later no one can tell whether a regression is a prompt bug, a spec gap, or an eval that was wrong from day one.
Buffer-and-hand-off integrations turn streaming tools into context-blowing, latency-shredding silent failures. A four-piece planner contract — streaming flag, running synopsis, consume_until, budget abort — keeps agents reasoning over trajectories instead of values.
When the schema is unchanged but the tool's behavior shifts, your agent quietly regresses. A field guide to detecting and containing tool behavior drift.
Each tool's ACL was fine. The composition leaked PII. Agent permission surfaces are the closure of the tool catalog under composition, and per-tool review audits a vocabulary while the planner builds sentences.
Agent latency budgets built from per-tool medians silently break in production: after seven steps, the tail dominates and per-tool dashboards stay green while users wait. A walkthrough of why p99 reshapes agent architecture, what the discipline looks like, and which forty-year-old distributed-systems techniques apply directly.
Tool calls returning success while the underlying operation never happened — the structural failure mode behind 'the model lied to the user' incidents and the verification layer high-stakes agents need.
Refusal rate looks like a safety control, but treating it as one ships polite, audit-clean models that users abandon. Why over-refusal hides in production, what hedges and bare declines do to retention, and how to grade refusals on a two-axis rubric instead of a single binary.
The 200–300ms turn-transition window forces voice agents into real-time architecture: streaming pipelines, semantic endpointing, speculative generation, and barge-in handling.