Adding a secondary LLM provider does not make your system redundant. It makes it twice as expensive to maintain — and brittle if you skip the prompt-engineering work that follows.
Rate limits calibrated for human-paced traffic collapse the first time an agent points a planning loop at the endpoint. Treat the limit as a split contract — throughput budget plus abuse ceiling — keyed on tenant and workload class.
When an agent spins up a recurring task and walks away, the schedule outlives its owner — and the orphan rate compounds quietly until somebody runs the audit.
Streaming LLMs surface partial reasoning users read as final answers. Why contradictions inside a single response break UX and evals, and four patterns that put a commit boundary back in.
A rising task-completion rate after a model upgrade can mean the agent got better — or that it stopped attempting the hard cases. Decompose your success metric or watch churn pay the bill.
A synthetic eval generated by a sibling of the model under test inflates scores while users drift. Why generator-discriminator collapse hides quality regressions, and the wild-eval architecture that catches them.
Provider APIs expose per-minute rate-limit headers but never the monthly cap your fleet actually has to plan against — leaving consumers to build the meter, the tier abstraction, and the starvation rule before the 429s arrive on day 26.
Agents absorb breaking tool changes silently because their tolerance for shape variation hides the very signal that strict clients used to surface. Here are the patterns that put the brittleness back where you can see it.
Trace replay validates an LLM upgrade against a context that no longer exists. Here is why the green number lies, and which validation primitive belongs at which point on the cost-versus-signal curve.
Your distributed trace dies at the inference API edge. Here is how to instrument streaming chunks, request IDs, and provider side-channels to reclaim the most expensive minute of your pipeline.
Your RAG ingest job runs while a writer is mid-edit, and the index captures a state of the wiki that never existed. Why polling-based pipelines do dirty reads at scale, and the CDC, version-pinning, and write-quiescence patterns that close the gap.
Sparse dev fixtures hide every behavior production state actually exercises. Until your agent runs against production-shaped cardinality and ambiguity, your green tests certify the wrong world.