A four-word edit to a system prompt can break parsers, judges, and chained agents that pinned the old wording. Prompts are APIs with silent consumers — and the discipline that keeps them stable looks a lot like REST endpoint deprecation.
Behavioral evals catch what your model says; they cannot catch what your prompt is. A prompt linter — fast, deterministic, structural — closes the gap that ships eval-green and surfaces as a 11pm production incident.
When a system prompt grows past 2K tokens, position bias makes a moved instruction as load-bearing as a rewritten one — and line-based diffs hide it. How three teams silently overwrite each other's intent, and the section-ownership and eval discipline that catches it.
The 'representative customer' you pasted into your few-shot prompt six months ago is still in production — re-identifiable, re-shipped, and invisible to DLP.
Layered safety pipelines silently triple p95 latency and cost on the long tail. Treat guardrails as a budgeted resource with tiered classifiers, parallel checks, and an honest latency contract.
Most RAG pipelines run two models in series — retriever and reranker — but eval suites only grade the generator's output. When the reranker drifts, the dashboard shows answer quality dropping with no causal arrow. Here's how to build a reranker eval that catches the silent regressions.
Sunsetting an AI assistant breaks differently than deprecating an API — the playbook needs cohort cuts, a maintenance-cost ledger, and comms calibrated to relationships, not contracts.
Classical retry policies assume bounded cost and independent retries. LLM workloads break both — and the bill compounds on the worst inputs. A field guide to rebuilding retry budgets for token economics.
Adding a retrieval step to fix every model failure looks like progress until your system is a pile of retrievers gluing together a prompt that still has the original problem. A diagnostic framework, ablation discipline, and complexity budget for RAG.
Human-in-the-loop AI fails quietly: the review queue grows, latency creeps, and the safety story breaks one item at a time. A field guide to SLOs, capacity tripwires, and tiered review for AI features.
An LLM call's behavior depends on the wall clock — batch size, cache state, and routing tier shift with provider load. Evals that run at 2 AM calibrate on conditions production never sees. Five practices that close the gap between off-peak eval and peak-hour reality.
An AI feature that succeeds 70% of the time can be worse than one that fails 70% of the time — concentrated, unpredictable failures collapse user trust faster than consistent unreliability. Why aggregate accuracy lies, why users cannot self-calibrate, and how to design for the uncanny zone.