Your provider hit the tokens-per-second SLA by shrinking chunks and your renderer paid the cost. Why streaming throughput is a co-designed property — and how to write a perceptual SLO the consumer owns.
Tool descriptions are interface contracts the team forgot to version. Here is how they rot, why the rot is silent, and the discipline that keeps your agent honest.
Prompt caching turns volatile tool results into a hidden TTL contract with your model provider. When the cache TTL outlives the data, your agent confidently serves yesterday's truth at the cache hit rate.
Browser-stamped spans in your agent trace are not comparable to gateway-stamped ones. Why client clocks lie, how the SDK propagates the lie, and the patterns that close the gap.
Ingestion-date sharded vector indexes hide a recall failure that aggregate metrics cannot see: the eval set is sampled with the same temporal bias the architecture imposes.
Your embedding pipeline fires on create but not on edit. Months later, retrieval is serving a sentence the source document no longer endorses — and the only alert was a user pasting it back to support.
When your agent's trace says 'verified X' but the verification never ran. Why self-attestation is a substrate problem, not a hallucination problem, and how to design evals and architectures that catch it.
A voice-agent latency SLO built on the LLM's time-to-first-token looks green while users hear a 600 ms gap. The SLO lives at the wrong layer; the user's ear is the boundary that matters.
Private eval sets leak the way everything leaks — through bug tickets, Slack pastes, vendor pipelines, and debug logs. Watermark them so you can tell a real capability gain from a contamination artifact when the next model upgrade lands.
p99 first-token under 800 ms looks like a promise — until a reasoning-tier swap quietly redefines what 'first token' means. The SLO measures the provider boundary; users feel a different one.
Sharing a long-term memory store across A/B variants couples the experiment to itself — variant B reads memories variant A wrote, the measured delta drifts, and the rollout regresses to a different point on the same contaminated surface.
Your contract names one vendor; your incident names a layer below it. A guide to mapping fourth-party risk, sizing real redundancy, and writing the postmortem you cannot fully own.