Prompt v37 was tuned against Claude 4.6. The platform team rolled the alias to 4.7. Your incident timeline says no deploys — because no one tracks the pair.
A 34% spend reduction from a provider 'auto' alias looks like a win until customer satisfaction on your highest-volume surface slides for two quarters — and the eval that signed off on the rollout never saw the prompts the router redirected to Haiku.
When a coding agent writes Closes #1247 from a stale comment, GitHub treats it as a load-bearing instruction. How agent-authored PR metadata destroys human review work, and the gates that prevent it.
A silent regression in ingestion-time dedup can flood your RAG context with near-identical chunks while every retrieval metric still looks healthy. Here is why relevance scores miss it, how MMR and contract tests close the gap, and why the pipeline's quality ceiling is bounded by its weakest invariant.
When an embedding upgrade quietly shifts a reranker's score distribution, a fixed threshold becomes a different filter — and the regression hides in a number nobody changed.
Provider rate-limit headers and the actual throttler often disagree — different windows, different scopes, different units. Why the gap exists and how to design control loops that survive it.
A reserved-throughput contract looks like a fixed-rate hedge until the provider quietly changes what counts as overflow. The defenses are reconciliation, contract language that pins metering, and a per-tier dashboard that catches drift in hours not quarters.
A retry loop that quietly delivered a 99.9% success-rate SLO and a 3x bill at the same time — why post-retry availability is the wrong number to put in front of leadership, what to measure instead, and the cost-quality knob hiding inside your reliability layer.
When a user removes an agent tool, a one-hour registry cache and a retry policy that treats denials as transient turn a routine revocation into a security incident the user discovers in their own audit log.
Aggregate cache hit rate hides the conversations whose affinity hint got dropped under load. Why per-pod KV state and best-effort routing collapse long agent sessions to cold-on-first-token, and what to instrument.
An inference-time personalization layer can lift aggregate satisfaction and still break the API contract by injecting hidden state that customers cannot read, reset, or reproduce in their own eval suites.
Streaming LLM responses look like a clean pipe from model to user — until a 35-second tool call lets a reverse proxy drop the connection, your client retries the entire prompt against a stateless API, and the user's bill pays for the same response twice.