Multi-Modal Agents in Production: What Text-Only Evals Never Catch
Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.
Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.
