Skip to main content

3 posts tagged with "chaos-engineering"

View all tags

The Fault You Never Inject: Feeding Your Agent a Tool That Lies

· 10 min read
Tian Pan
Software Engineer

Open the resilience suite for your agent and look at what it actually tests. You will find timeouts. You will find connection drops, 500s, rate-limit responses, malformed JSON, maybe a tool that hangs for thirty seconds before failing. All of it is fault injection in the classic mold: the tool is broken, and the question is whether your agent degrades gracefully.

Now look for the test where the tool is not broken at all. The one where the tool responds in 80 milliseconds, returns perfectly valid JSON against the schema, and the value inside is simply wrong. A balance that is stale by three days. A customer record with two fields swapped. An order quantity with two digits transposed. An empty result list for a query that should have returned forty rows.

You will not find it. Almost nobody injects that fault. And it is the one fault your agent is least equipped to survive, because every other fault announces itself and this one does not.

Fallback Path Atrophy: Your Graceful Degradation Stopped Working Three Months Ago

· 9 min read
Tian Pan
Software Engineer

The fallback path you wrote nine months ago — the one that catches model timeouts, swaps to a cheaper provider, returns a templated message when both are down — has not actually run in production for the last twelve weeks. It was exercised once during the original launch, the integration tests still pass against it, and the runbook still references it. None of that means it works. A refactor in week six changed the shape of the upstream context object. A library bump in week nine quietly moved a config key. The code still compiles. The tests still pass because they were written against the same stale fixtures the code was. The next time your primary path 504s, your "graceful degradation" will throw a NullPointerException into a user's face, and the postmortem will note — for the third time this year — that the fallback was never re-tested after the upstream contract changed.

This is the quiet failure mode of resilience engineering in AI systems. The fallback path is the part of your application that exists specifically to be ignored. Production traffic flows around it for ninety-nine days out of a hundred. CI never exercises it because no test was ever wired to. The team that owns it forgets it exists between incidents. Then on day one hundred, when the primary model provider has a regional outage and you finally need it, the path bit-rots in front of a paying customer.

Chaos Engineering for AI Agents: Injecting the Failures Your Agents Will Actually Face

· 9 min read
Tian Pan
Software Engineer

Your agent works perfectly in staging. It calls the right tools, reasons through multi-step plans, and returns polished results. Then production happens: the geocoding API times out at step 3 of a 7-step plan, the LLM returns a partial response mid-sentence, and your agent confidently fabricates data to fill the gap. Nobody notices until a customer does.

LLM API calls fail 1–5% of the time in production — rate limits, timeouts, server errors. For a multi-step agent making 10–20 tool calls per task, that means a meaningful percentage of tasks will hit at least one failure. The question isn't whether your agent will encounter faults. It's whether you've ever tested what happens when it does.