Agents replan around 503s and retry faster than any human ever will, turning a small upstream wobble into a correlated outage. A practitioner's view of the load-shedding primitives platforms need next, and the disciplines agents have to adopt to stop being the storm.
Long-context vs RAG is not a product-wide architecture choice in 2026 — it is a per-feature decision driven by four axes (freshness, attribution, tail-risk, cost). A breakdown of the discipline that keeps your AI surfaces on the right side of math that keeps moving.
Provider sunset emails arrive on a 60-day clock. The registry, calendar, n+1 evals, and contract terms that turn each migration into mechanical work — built before the email lands, not after.
Benchmark-trained routers ship a quiet quality regression: the cheap path looks fine in aggregate, then fails on a small, loud cluster of users your eval suite never sampled. Why a router is a control system, not a classifier — and what closing the loop actually requires.
Most teams shipped multimodal as a thin extension of their text product and inherited an eval discipline that systematically can't see image or audio regressions. The fix is per-modality rubrics, modality-specific gold sets, and a release gate that refuses to aggregate quality across input types.
AI features quietly broke the multi-tenant isolation playbook at four new layers — prompt cache, fine-tune, embedding index, KV-cache reuse. What changed and the discipline production teams need to put back.
A 200-line system prompt has no signature, no tests, and a diff history that says nothing about why each line is there. A 30-day curriculum — failure gallery, ablation, PR reconstruction, gated edit — that teaches new engineers to read a prompt by interrogating its behavior.
Production prompts decay silently as models, tokenizers, and product rules shift underneath them. Treat every prompt as a depreciating asset with an owner, a revalidate-by date, and an eval delta — or accept the quality regression nobody on the team intentionally shipped.
An overnight two-point eval drop and a prompt PR with seventeen edits is a binary search problem, not a guessing game. Here is how to bisect a prompt the way kernel maintainers bisect a kernel — and the commit-granularity discipline it forces on the team.
Most data classification schemes never modeled the prompt layer as a vendor egress channel. Adding a prompt-eligibility tier — and the template audit that fills it — closes a compliance gap your DLP scheme silently denies.
Prompt extraction is the quiet attack on LLM products. Treat the system prompt as public, move secrets out of context, and build an eval for it.
Pushing prompts through CDN-style rollout systems creates silent geography-split A/B tests when one region drifts ahead of another. Here is the rollout discipline, observability dimension, and rollback model that keep prompt versions globally coherent.