Classical APM treats an agent step as one fat span and leaves on-call engineers guessing. Decompose it into seven phases, separate prefill from decode, and chase the critical path instead of total span time.
Production APIs are now serving two species of caller — humans and agents — with different traffic physics, failure modes, and threat profiles. Treating them as one is the source of every flaky-endpoint investigation in 2026.
Multi-tool agent undo is a saga-pattern problem in disguise. Pre-computed inverses, residue UX, and cascade caps decide whether reversal succeeds or silently fails 40% of the time.
Agent workflows can burn 50–200x the energy of a single chat completion, and procurement teams have started asking. A pragmatic guide to per-task carbon attribution, the routing decisions a carbon budget forces, and why the team that instruments first wins the room.
Most cyber and E&O policies were written for breaches and bugs, not agents acting under your credentials. The coverage gap shows up at claim time, when nobody planned for it.
Leetcode screens and system-design rounds were calibrated on engineers writing deterministic code. AI engineering needs a different signal — the round that detects it is eval-design, not implementation.
Sunsetting an AI feature is not like sunsetting an API. The contract is the model's observed behavior, and users build invisible scaffolding on top of it that breaks on cutover.
Quarterly OKRs were calibrated for deterministic software. AI features have launch curves and sustain curves, and the template that treats them as deliverables produces demos that decay between planning cycles.
Every production AI feature has four artifact owners and zero owners for the integrated user experience. That gap is where seam bugs live — and the org-design fix that closes it.
Most demos work. A meaningful fraction of shipped AI features are still task-shape mismatched — stochastic engines wired into deterministic-required outputs. A pre-build checklist and the roadmap pathway you need to redirect ideas that are not model-shaped.
Standard engineering interview loops select for deterministic-systems skills and miss the cluster — eval design, cost intuition, prompt debugging, recovery-mindedness — that predicts who ships LLM products. The fix is loop redesign, not another bolted-on AI round.
Pages that say the model started lying do not fit a runbook designed for restart the service. Here is the five-surface triage tree, freeze button, and replay harness that make AI on-call its own discipline.