The degraded path you built to survive an outage runs almost never, so it rots quietly and makes its debut during the exact incident it was designed to survive.
A 94% pass rate measures how good you are at imagining success, not whether the agent works. How golden-path bias creeps into agent eval suites and how to fix it with failure-mode coverage, harvested production cases, and fault injection.
When an agent's tool call times out, it retries — and without an idempotency key, that retry charges the card again. How to make agent retries harmless instead of dangerous.
An agent deleted the wrong record and the postmortem has a hole where the cause should be. Reproducibility for AI agents is not inherited from your stack — it is something you capture, version, and replay on purpose.
An internal API stays internal only as long as you can name every caller. Wire an LLM agent to it and the contract you never wrote down becomes a liability — here's the public-API discipline you suddenly owe it.
Grade model outputs with an LLM judge and the judge is a model with its own behavior. The day it changes, every historical score becomes a foreign currency — and most teams never notice.
A closed loop where one model reviews another and feeds the next eval has no ground truth anywhere — errors get laundered into high scores. Here is where to put the human back.
A model deprecation notice reads like a one-line config change, but the prompt you tuned for six months was fitted to one model's quirks and does not survive the swap. Treat model end-of-life as a recurring migration project with a re-runnable eval set.
A junior engineer accumulates context every week; an agent accumulates nothing. Why the new-hire metaphor misallocates your attention, and where to put the learning instead.
Agent products gate dangerous actions behind approval dialogs and call it oversight, but by the fortieth prompt the human clicks approve on reflex. Why prompt volume is the real safety bug, and how to fix it.
A prompt cache key is a correctness boundary, not a billing knob. Draw it for hit rate and you invite cross-tenant context bleed and stale personalization.
Treating prompt injection as a content-filtering problem is a losing arms race. The real vulnerability is a confused deputy: an agent acting on untrusted instructions with borrowed authority. Scope the capability instead.