Spec, prompt, and eval are three translations of the same intent into different media. Without enforced consistency, they drift, and a year later no one can tell whether a regression is a prompt bug, a spec gap, or an eval that was wrong from day one.
Buffer-and-hand-off integrations turn streaming tools into context-blowing, latency-shredding silent failures. A four-piece planner contract — streaming flag, running synopsis, consume_until, budget abort — keeps agents reasoning over trajectories instead of values.
When the schema is unchanged but the tool's behavior shifts, your agent quietly regresses. A field guide to detecting and containing tool behavior drift.
Each tool's ACL was fine. The composition leaked PII. Agent permission surfaces are the closure of the tool catalog under composition, and per-tool review audits a vocabulary while the planner builds sentences.
Agent latency budgets built from per-tool medians silently break in production: after seven steps, the tail dominates and per-tool dashboards stay green while users wait. A walkthrough of why p99 reshapes agent architecture, what the discipline looks like, and which forty-year-old distributed-systems techniques apply directly.
Tool calls returning success while the underlying operation never happened — the structural failure mode behind 'the model lied to the user' incidents and the verification layer high-stakes agents need.
Refusal rate looks like a safety control, but treating it as one ships polite, audit-clean models that users abandon. Why over-refusal hides in production, what hedges and bare declines do to retention, and how to grade refusals on a two-axis rubric instead of a single binary.
The 200–300ms turn-transition window forces voice agents into real-time architecture: streaming pipelines, semantic endpointing, speculative generation, and barge-in handling.
Transparent tool retries silently burn wall-clock budget while the planner reasons against a stale deadline, producing a bimodal SLA failure that no single layer's metric catches.
A third population sits between cooperative users and malicious attackers: the curious customer who treats your AI agent as a puzzle. Here is how to build evals, refusals, and fallbacks that hold up when they are the moment your brand is being judged.
Provisioned throughput sized on user QPS quietly under-provisions agent products by the loop fan-out factor. Plan with model-call rate, loop depth, and burst tails instead.
Two agent runs of the same prompt almost never produce the same output. Diffing them at the text level hides the actual cause. Here is what structural diffing requires and how to build for it.