A synthetic eval generated by a sibling of the model under test inflates scores while users drift. Why generator-discriminator collapse hides quality regressions, and the wild-eval architecture that catches them.
Provider APIs expose per-minute rate-limit headers but never the monthly cap your fleet actually has to plan against — leaving consumers to build the meter, the tier abstraction, and the starvation rule before the 429s arrive on day 26.
Agents absorb breaking tool changes silently because their tolerance for shape variation hides the very signal that strict clients used to surface. Here are the patterns that put the brittleness back where you can see it.
Trace replay validates an LLM upgrade against a context that no longer exists. Here is why the green number lies, and which validation primitive belongs at which point on the cost-versus-signal curve.
Your distributed trace dies at the inference API edge. Here is how to instrument streaming chunks, request IDs, and provider side-channels to reclaim the most expensive minute of your pipeline.
Your RAG ingest job runs while a writer is mid-edit, and the index captures a state of the wiki that never existed. Why polling-based pipelines do dirty reads at scale, and the CDC, version-pinning, and write-quiescence patterns that close the gap.
Sparse dev fixtures hide every behavior production state actually exercises. Until your agent runs against production-shaped cardinality and ambiguity, your green tests certify the wrong world.
Cron-fired AI agents inherit four clocks — scheduler, worker, model, and tool — and most production systems silently trust the wrong one. A walk through the failure modes and the time-handoff contract that prevents them.
On turn twelve, your conversation's TTFT spikes 4x and your trace explains nothing. The KV cache you depended on was evicted by another tenant's request, and you have no telemetry that names the cause.
Teaching your agent to say 'I don't know' looks like a safety win until the human queue absorbs the bill. The end-to-end math behind LLM abstention as a cost-shifting move.
LLMs are token predictors, not string copiers. When two similar account numbers share a context, the agent can transpose digits, route a refund to the wrong customer, and leave a clean trace behind. The fix is to take identifier fidelity out of the model's job description.
A 400 is not a transient error. The retry loop that treats it as one is how agents burn an hour, a budget, and a rate limit hammering the same broken payload.