Skip to main content

One post tagged with "fault-injection"

View all tags

Chaos Engineering for AI Agents: Injecting the Failures Your Agents Will Actually Face

· 9 min read
Tian Pan
Software Engineer

Your agent works perfectly in staging. It calls the right tools, reasons through multi-step plans, and returns polished results. Then production happens: the geocoding API times out at step 3 of a 7-step plan, the LLM returns a partial response mid-sentence, and your agent confidently fabricates data to fill the gap. Nobody notices until a customer does.

LLM API calls fail 1–5% of the time in production — rate limits, timeouts, server errors. For a multi-step agent making 10–20 tool calls per task, that means a meaningful percentage of tasks will hit at least one failure. The question isn't whether your agent will encounter faults. It's whether you've ever tested what happens when it does.