Retries Aren't Free: The FinOps Math of LLM Retry Policies
A team I talked to last quarter found a $4,200 line item on their inference invoice that nobody could explain. The dashboard showed normal traffic. The latency graphs were flat. The cause turned out to be a single agent stuck in a polite retry loop for six hours, replaying a 40k-token tool chain with exponential backoff that capped out at thirty seconds and then started over. The retry policy was lifted verbatim from an internal SRE handbook written in 2019 for a JSON-over-HTTP service. It worked perfectly. It worked perfectly for the wrong system.
This is the bill that does not show up in capacity-planning spreadsheets. The retry-policy patterns the industry standardized on for stateless REST APIs assume three things that LLM workloads quietly violate: failures are transient, the cost of one extra attempt is bounded, and a retry has a meaningful chance of succeeding. Each assumption was load-bearing. Each one is now wrong, and the variance the cost model never captured is sitting at the bottom of every monthly invoice.
The teams that have not rebuilt their retry policy for token economics are paying a hidden tax that scales with the difficulty of the queries they were already most worried about — the long ones, the agentic ones, the ones with deep tool chains. The retry budget that classical resilience engineering hands you back as a safety net is, in an LLM stack, the rope.
