The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features
Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.
This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.
