Every LLM has a knowledge cutoff and every product silently lies about it. Treat freshness as a designed UX surface — not a footnote — or users will calibrate trust against an answer the model should have refused.
Vector indexes degrade gracefully but knowledge graphs fail discontinuously — running them behind one CDC pipeline ships silently wrong answers on multi-hop queries.
LLM-as-judge has length, position, and format biases that silently turn prompt iteration into a Goodhart machine. Three audits and a versioned judge fix it.
The SRE postmortem template was built for code changes and infrastructure faults. For LLM incidents, the variables that actually moved are missing — prompt revision, model selection slice, judge config, retrieval index state, tool schemas, traffic mix. Here is the template fields and incident-class taxonomy that close the gap.
Agents replan around 503s and retry faster than any human ever will, turning a small upstream wobble into a correlated outage. A practitioner's view of the load-shedding primitives platforms need next, and the disciplines agents have to adopt to stop being the storm.
Long-context vs RAG is not a product-wide architecture choice in 2026 — it is a per-feature decision driven by four axes (freshness, attribution, tail-risk, cost). A breakdown of the discipline that keeps your AI surfaces on the right side of math that keeps moving.
Provider sunset emails arrive on a 60-day clock. The registry, calendar, n+1 evals, and contract terms that turn each migration into mechanical work — built before the email lands, not after.
Benchmark-trained routers ship a quiet quality regression: the cheap path looks fine in aggregate, then fails on a small, loud cluster of users your eval suite never sampled. Why a router is a control system, not a classifier — and what closing the loop actually requires.
Most teams shipped multimodal as a thin extension of their text product and inherited an eval discipline that systematically can't see image or audio regressions. The fix is per-modality rubrics, modality-specific gold sets, and a release gate that refuses to aggregate quality across input types.
AI features quietly broke the multi-tenant isolation playbook at four new layers — prompt cache, fine-tune, embedding index, KV-cache reuse. What changed and the discipline production teams need to put back.
A 200-line system prompt has no signature, no tests, and a diff history that says nothing about why each line is there. A 30-day curriculum — failure gallery, ablation, PR reconstruction, gated edit — that teaches new engineers to read a prompt by interrogating its behavior.
Production prompts decay silently as models, tokenizers, and product rules shift underneath them. Treat every prompt as a depreciating asset with an owner, a revalidate-by date, and an eval delta — or accept the quality regression nobody on the team intentionally shipped.