Skip to main content

2 posts tagged with "fault-injection"

View all tags

The Fault You Never Inject: Feeding Your Agent a Tool That Lies

· 10 min read
Tian Pan
Software Engineer

Open the resilience suite for your agent and look at what it actually tests. You will find timeouts. You will find connection drops, 500s, rate-limit responses, malformed JSON, maybe a tool that hangs for thirty seconds before failing. All of it is fault injection in the classic mold: the tool is broken, and the question is whether your agent degrades gracefully.

Now look for the test where the tool is not broken at all. The one where the tool responds in 80 milliseconds, returns perfectly valid JSON against the schema, and the value inside is simply wrong. A balance that is stale by three days. A customer record with two fields swapped. An order quantity with two digits transposed. An empty result list for a query that should have returned forty rows.

You will not find it. Almost nobody injects that fault. And it is the one fault your agent is least equipped to survive, because every other fault announces itself and this one does not.

Chaos Engineering for AI Agents: Injecting the Failures Your Agents Will Actually Face

· 9 min read
Tian Pan
Software Engineer

Your agent works perfectly in staging. It calls the right tools, reasons through multi-step plans, and returns polished results. Then production happens: the geocoding API times out at step 3 of a 7-step plan, the LLM returns a partial response mid-sentence, and your agent confidently fabricates data to fill the gap. Nobody notices until a customer does.

LLM API calls fail 1–5% of the time in production — rate limits, timeouts, server errors. For a multi-step agent making 10–20 tool calls per task, that means a meaningful percentage of tasks will hit at least one failure. The question isn't whether your agent will encounter faults. It's whether you've ever tested what happens when it does.