Enterprise CISOs now run AI-specific security reviews with 80+ questions on training data, prompt logs, tenant isolation, and refusal behavior. A field guide to what they actually want.
Classical A/B math assumes deterministic per-user behavior. LLM features break that assumption twice over, and the standard sample-size template ships wrong calls in both directions — here are the four shifts that fix it.
Async agents that finish 90 seconds late often deliver answers to questions the user no longer has. A delivery-time relevance gate, not faster models, is the fix.
When an agent goes off the rails, the forensic record most teams have is useless. Here are the fields a flight recorder must capture before the first incident — and the storage, sampling, and privacy disciplines that have to land alongside it.
Long-running agents drift from the world the moment they stop watching. Treat memory like a database replica: watermarks, change feeds, and lazy revalidation.
Classic SRE practice gives you uptime and latency targets that map cleanly to user happiness. Agentic features break that mapping. Here's how to write an error budget when 'success' arrives hours after the request — and why the team that copies the latency-SLO playbook will meet every quarterly target while users churn.
Classical APM treats an agent step as one fat span and leaves on-call engineers guessing. Decompose it into seven phases, separate prefill from decode, and chase the critical path instead of total span time.
Production APIs are now serving two species of caller — humans and agents — with different traffic physics, failure modes, and threat profiles. Treating them as one is the source of every flaky-endpoint investigation in 2026.
Multi-tool agent undo is a saga-pattern problem in disguise. Pre-computed inverses, residue UX, and cascade caps decide whether reversal succeeds or silently fails 40% of the time.
Agent workflows can burn 50–200x the energy of a single chat completion, and procurement teams have started asking. A pragmatic guide to per-task carbon attribution, the routing decisions a carbon budget forces, and why the team that instruments first wins the room.
Most cyber and E&O policies were written for breaches and bugs, not agents acting under your credentials. The coverage gap shows up at claim time, when nobody planned for it.
Leetcode screens and system-design rounds were calibrated on engineers writing deterministic code. AI engineering needs a different signal — the round that detects it is eval-design, not implementation.