Skip to main content

Multi-Modal Agents in Production: What Text-Only Evals Never Catch

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.

Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.

The Hidden Cost Multiplier Nobody Warns You About

The first shock is the token bill. Most engineers understand that vision inputs cost more—what they don't understand is the scaling function.

Vision tokens scale roughly quadratically with image resolution. Double the image dimensions and you quadruple the token count. An agent that happily processes a 1,000-token text document can suddenly be consuming 8,000–15,000 tokens per page when a PDF is passed as images. At scale, this isn't a rounding error; it's the difference between a viable unit economics model and one that breaks the product.

The 2025 research on token efficiency makes the problem concrete: adaptive compression techniques can reduce vision token counts by 60–90% while maintaining task performance, but implementing them requires deliberately choosing target resolution at ingestion time rather than defaulting to whatever the user uploads. Most production systems don't have that gate. They accept the file, render it, and let the model consume whatever comes out.

The cost asymmetry also affects architectural decisions that aren't obvious at design time. When your agent loop processes a sequence of steps—retrieve, read, reason, act—and each "read" step involves vision, context window consumption becomes a ceiling on task complexity. Agents fail not because the model can't reason, but because they exhaust the context before getting to the part where reasoning would have helped.

Practical guidance: establish a resolution policy before you write a line of agent code. Define the maximum resolution you'll pass for each document type. Enforce it at ingestion. Monitor token-per-document distributions in production. These three things alone catch most cost surprises.

OCR vs. Vision Models: The Tradeoff Nobody Gets Right the First Time

The natural instinct when adding document processing to an agent is to pick one: either use traditional OCR to extract text and feed that text to the language model, or use a vision-language model directly and let it "read" the document visually. Both extremes have failure modes.

Pure OCR is fast and cheap. For clean, standardized documents—government forms, well-formatted invoices, printouts of structured data—OCR accuracy above 97% is achievable. The pipeline is deterministic and repeatable, which matters for debugging. The problem is that OCR depends on predictable layout. When documents have variable structure, handwriting, degraded scans, or mixed content (tables embedded in prose, charts with labeled axes), OCR accuracy collapses. Accuracy on handwritten fields drops to the 65–78% range. On noisy scans, below 60%.

Vision-language models handle variable layouts naturally. They achieve 85–95% accuracy on handwritten text where OCR struggles, and they extract meaning from complex tables without requiring layout templates. But they're orders of magnitude slower—milliseconds for OCR versus several seconds per page for a VLM—and at serious scale, the cost differential is substantial. At ten million pages per month, self-hosted OCR runs around 900;cloudVLMAPIsrunaround900; cloud VLM APIs run around 14,000 for equivalent throughput.

The pattern that dominates in production is hybrid: OCR for initial extraction, VLM for validation and correction on low-confidence fields. This captures most of OCR's cost and latency advantages while recovering accuracy on the cases that matter. The implementation detail that makes this work is a confidence score gate—OCR outputs below a threshold get re-processed by the VLM, while high-confidence extractions pass through directly.

The decision matrix simplifies to three questions: How variable is the document layout? How frequent is handwriting or degraded input? What is the acceptable per-document cost? If layout is predictable and documents are clean, OCR with VLM spot-checking is almost always the right answer. If layout is unpredictable or quality is variable, budget for VLM from the start.

Grounding Failures: When the Model Sees Things That Aren't There

Hallucination in text models is a known, studied problem. Grounding failures in vision models are less well-understood in production contexts, and the failure modes are different enough that text-model intuitions don't transfer cleanly.

Object hallucination is the most common class: the model generates a confident description of an object that doesn't appear in the image. The mechanism is well-documented—VLMs over-rely on language priors, producing descriptions that are statistically plausible rather than visually grounded. In a document processing context, this manifests as the model confidently reporting field values that don't exist in the source document.

Spatial reasoning failures are more subtle and harder to detect without purpose-built evals. On controlled benchmarks for spatial tasks—identifying which object is to the left of which, inferring relative scale, understanding occlusion—current VLMs achieve 50–55% accuracy versus 85%+ for humans. This isn't a small gap. For agents that need to interpret UI screenshots, understand form layouts, or extract values from tables based on their position, spatial reasoning errors cause systematic failures that look like random noise until you have evals that probe them specifically.

The mitigation strategies that work in practice are:

  • Visual Contrastive Decoding: contrast model outputs against predictions from a distorted version of the same image. Outputs that only appear when the image is intact are grounded; outputs that appear in both cases are likely driven by language priors.
  • Multi-agent correction: route uncertain outputs to a second model pass with different sampling temperature, then compare. Disagreements flag low-confidence extractions for human review.
  • Structured output validation: when extracting structured data from documents, validate extracted values against expected ranges, formats, and cross-field constraints. The model may hallucinate a field value; it's harder to hallucinate one that passes business-rule validation.

The Eval Gap That Text-Only Pipelines Don't Cover

The most common mistake teams make when launching multi-modal agents is testing the vision component in isolation and assuming the agent integration is correct by implication. It isn't.

Best-in-class multimodal models achieve around 27% accuracy on complex, realistic web-based agentic tasks in benchmark settings—tasks that require interpreting visual context, selecting actions, and completing multi-step goals. This isn't a critique of the models; it's a calibration point. Your production task is harder than a benchmark, your documents are messier than benchmark inputs, and your users will probe edge cases no benchmark anticipated.

The eval patterns that catch multi-modal regressions before users do share a few properties:

Round-trip extraction tests: extract structured data from a document, then re-render the extracted data and compare it to the original. Significant divergence indicates the extraction step is unreliable, even when the extracted data looks plausible in isolation.

Adversarial input suites: maintain a suite of documents that previously caused failures—blurry scans, documents with unusual fonts, forms with overlapping text, screenshots at various resolutions. Run every model update against this suite before promoting to production.

Confidence-calibration checks: measure whether the model's stated confidence correlates with actual accuracy. Models that are confidently wrong on a specific input class (common in spatial reasoning tasks) need routing logic that bypasses their self-assessment.

Cross-modal consistency tests: for inputs that have both a visual and text representation (a PDF that also has embedded text, a table that also exists in a database), verify that the model's extraction from the visual matches the ground truth from the text. Inconsistencies reveal the specific failure modes of visual grounding.

The structural pathology that catches teams off guard is perceptual fragility: in multimodal pipelines, adding visual inputs sometimes degrades performance on subtasks that the text-only version handled correctly. The visual input introduces positional bias and alignment artifacts that scatter attention. Monitor text-only metrics after adding vision—don't assume the new capability is additive.

Context Window Management Across Vision Steps

In a multi-step agent loop, each vision step compounds the context consumption problem. A five-step task that processes a document at each step can exhaust a 200K-token context window before the agent reaches step four, depending on image resolution and document complexity.

The failure mode is invisible from the outside. The agent doesn't return an error; it silently produces outputs based on truncated context, often without any indication that important information was dropped. The "Lost in the Middle" effect—where models fail to use information that appears in the middle of long contexts—amplifies this: even when context isn't technically exhausted, information from early vision steps may effectively disappear from the model's reasoning.

Three structural approaches manage this:

  • Selective context injection: instead of passing all prior images to each step, extract the structured outputs from completed vision steps and pass those as text. Only include the raw image when the current step requires it.
  • Progressive summarization: after each vision step, have the model produce a compact structured summary of what it learned. Use that summary in subsequent steps instead of the raw image.
  • Step-level context budgeting: set explicit token limits per step and enforce them at the orchestration layer. When a step would exceed its budget, downsample the image or truncate the context before passing it to the model.

Building Production Readiness: What Actually Matters

Integrating vision into an agent pipeline is a multi-month effort for teams that get it right, not a weekend integration task. The architecture decisions that look optional in a demo—resolution policy, OCR/VLM routing, confidence gating, context budgeting—become load-bearing in production.

The teams that ship reliable multi-modal agents share a few practices:

  • They maintain a growing library of real production failures—actual documents from actual users that broke the pipeline—and use them as eval inputs before every release.
  • They instrument confidence scores at every extraction step and use those signals to route borderline cases to human review rather than accepting uncertain model outputs.
  • They run text-only metrics alongside multi-modal metrics and alert on regressions in either direction. Vision inputs can silently degrade text-task performance.
  • They treat real-time monitoring as a first-class requirement, not an afterthought. Most production multi-modal failures manifest as subtle degradation—wrong field values, misread numbers, spatial reasoning errors—not hard crashes. Without per-step monitoring, they're invisible until users report them.

The 27% benchmark accuracy number is a useful anchor. If the best multi-modal agents in the world get roughly one in four complex tasks right on controlled benchmarks, production systems with real-world document variety and adversarial user inputs need defensive architecture that degrades gracefully, surfaces uncertainty honestly, and routes low-confidence outputs to appropriate fallbacks. That's the difference between a demo and a product.

References:Let's stay in touch and Follow me for more thoughts and updates