Production RAG pipelines silently assume snapshot isolation between retrieval and generation. They never enforce it — and the bug shows up as deleted-chunk citations, edited-chunk inversions, and stale-permission leaks.
Your tool catalog plus a planner forms a reachable graph of plans your evals have probably never exercised. Borrow reachability analysis from compilers to find the branches your incident channel will discover first.
Reasoning tokens look like output tokens on the bill but balloon 3–10x and have no natural ceiling. Treat thinking effort as a tunable resource — measured in yield, governed by budgets, routed by difficulty, and surfaced as its own dashboard line item before finance asks about it first.
Most agent frameworks default to exponential-backoff retry on tool errors — a pattern borrowed from stateless HTTP that's actively wrong inside a stateful planning loop. The right default is replan.
Agent-authored PRs concentrate bugs in different places than human PRs, and the reviewer instincts trained on years of human code quietly fail on them. A walkthrough of the new bug profile, why fluent diffs are dangerous, and the three artifacts every reviewer now has to read together.
AI-generated preference labels are 100x cheaper than human ones — and they teach your model to prefer the judge's aesthetic, not your users'.
Cost-aware LLM routing makes the cheap model the actual product surface for most users. If your eval discipline still points at the flagship, you are flying blind on 70% of traffic — here is the router-as-product framing that fixes it.
Agent harnesses that propagate temperature down the call tree turn the planner's creativity knob into the verifier's bug. Per-role sampling profiles, default-deny inheritance, and the disagreement-rate eval that catches the leak.
Frameworks ship session-ids; users live in tasks. The gap between them is where half of agent UX disappears, and the fix is a task-id, not longer sessions.
Production-trace eval pipelines accumulate PII no one promised users would be processed this way. The fix is sanitization at the write boundary, schema-typed spans, and tag-based retention — not regex scrubbers at read time.
MCP made it trivially cheap to wire a developer laptop into prod-adjacent systems. The artifact is a loopback socket using credentials the engineer already has — invisible to procurement, CASB, and SSO logs. The discovery and governance discipline that has to land before the first breach disclosure.
Centralizing a safety preamble looks like a clean DRY win until the first edit ships and thirty consumer teams' evals tank. Here's why shared prompts behave like distributed systems, and the governance scaffolding that survives the first flag day.