The Integration Test Mirage: Why Mocked Tool Outputs Hide Your Agent's Real Failure Modes
Your agent passes every test. The CI pipeline is green. You ship it.
A week later, a user reports that their bulk-export job silently returned 200 records instead of 14,000. The agent hit the first page of a paginated API, got a clean response, assumed there was nothing more, and moved on. Your mock returned all 200 items in one shot. The real API never told the agent there were 70 more pages.
This is not a model failure. The model reasoned correctly. This is a test infrastructure failure — and it's endemic to how teams build and test agentic systems.
