Multimodal LLM Inputs in Production: Vision, Documents, and the Failure Modes Nobody Warns You About
Adding vision to an LLM application looks deceptively simple. You swap a text model for a multimodal one, pass in an image alongside your prompt, and the demo works brilliantly. Then you push to production and discover that half your invoices get the total wrong, tables in PDFs lose their structure, and low-quality scans produce confident hallucinations. The debugging is harder than anything you faced with text-only systems, because the failures are visual and the LLM will not tell you it cannot see clearly.
This post covers what actually goes wrong when you move multimodal LLM inputs from prototype to production, and the architectural decisions that prevent those failures.
Vision Tokens Are Not Free — And They Scale Quadratically
The first production surprise is cost. Vision tokens are priced 2–5× higher than text tokens, and the number of tokens consumed scales quadratically with image resolution, not linearly.
GPT-4o's high-detail mode works as follows: the image is tiled into 512×512 pixel chunks, and each tile costs 170 tokens, plus an 85-token base. A 1024×1024 image produces four tiles — 765 tokens just for one image. Scale that to a document pipeline processing 50,000 invoices per day and the numbers become significant before you have even written a single prompt.
Claude's tile budget is similar: images are broken down into grids capped at 1568 pixels on the longest edge, with each block contributing to the token count. The recommendation from providers to "use the lowest resolution that works for your task" sounds obvious until you realize that "lowest resolution that works" requires empirical testing per document type, not a single global setting.
The optimization path involves:
- Downscaling images to provider maximums (1568px for Claude, 2048px for GPT-4o) before sending — most production images from phone cameras arrive at 4032×3024
- Using provider-specific low-detail modes for tasks that only need gross visual features (classification, routing)
- Caching images when the same document is queried multiple times — prompt caching for images reduces repeated costs by 90%
What many teams miss: vision encoder latency dominates at high resolutions. When you send a 4K image, the bottleneck is not LLM inference — it is the vision encoder processing time, which increases your time-to-first-token substantially. Latency and cost move together here.
The OCR vs. Native Vision Decision Is Not Binary
The common framing is: "Should I use OCR, or just pass the image to the LLM directly?" That framing leads to the wrong architecture. The correct question is: "What is the failure mode I am trying to prevent, and which approach avoids it?"
Native vision excels when:
- Document templates vary widely (invoices from 200 different vendors, each with a different layout)
- The task requires semantic reasoning alongside extraction ("what is the primary line item?")
- Documents contain handwriting or mixed scripts
- You need to extract information from charts, diagrams, or images embedded within the document
Traditional OCR wins when:
- You need deterministic, reproducible output (financial records, compliance documents)
- The text is clean and well-structured (machine-printed forms)
- You need character-level accuracy above 97%
- Cost is the primary constraint and the document format is stable
The practical trap: teams pick native vision because it handles layout variation, then discover their model hallucinates on the 5% of documents with degraded scan quality, producing confident but wrong extractions. The LLM does not say "I cannot read this clearly." It invents a plausible value.
The emerging production pattern is a two-stage hybrid. A lightweight quality classifier routes documents: clean, high-quality scans go to a fast OCR pipeline, while complex or degraded documents go to the vision LLM. Grab's engineering team building their document processing system (Documint) took this direction — they found that proprietary LLMs alone were too slow, too expensive, and hallucinated on Southeast Asian language documents (Thai, Vietnamese) without specialized training data. Their custom vision LLM, built on Qwen2-VL fine-tuned with synthetic OCR data, achieved 50% speedup over the base model while maintaining accuracy on non-Latin scripts.
The lesson: benchmark accuracy scores from providers are measured on clean benchmark datasets. Your documents are not clean benchmark datasets.
PDF Tables Break Most Document Pipelines
Tables in PDFs are one of the most reliable ways to expose the limits of a document processing architecture. This is worth treating as a specific failure category rather than a general "document parsing" problem.
The standard approach of extracting PDF text as a flat string and chunking it breaks tables in two ways. Parsing text from PDFs in reading order produces something like Column A Column B Column C value1 value2 value3, stripping the two-dimensional structure that gives the table meaning. Then chunking by token count frequently cuts across table boundaries, sending the LLM half a table without headers.
Vision-guided chunking approaches treat the PDF page as an image and use a vision model to identify semantic boundaries — table start/end, section headers, figure captions — before deciding where to split. Research on this approach shows 84.4% page-level and paragraph-level retrieval accuracy, significantly better than fixed-size chunking.
The more radical solution is ColPali, which appeared in a 2025 ICLR paper and is now production-ready via integrations with Vespa and other vector databases. ColPali treats entire PDF pages as images and indexes them directly using a vision language model — no OCR, no text extraction, no chunking decisions. Retrieval is done via visual similarity. For document collections where layout carries meaning (forms, invoices, technical datasheets), this eliminates an entire class of chunking failure.
The trade-off: ColPali trades preprocessing complexity for query latency. Visual embeddings are larger than text embeddings, which matters at scale. Vespa's implementation handles billions of PDFs by indexing 128-dimensional embeddings per page tile, but the infrastructure cost is higher than a traditional text index.
For most teams, the right starting point is not ColPali — it is structured text extraction with explicit table handling (PDFPlumber, pdfminer with explicit table detection), falling back to vision extraction for pages where text extraction produces garbage.
Hallucination on Degraded Inputs Is Unique to Vision
Text-only LLMs hallucinate. But multimodal models hallucinate in a qualitatively different way when given low-quality visual input: they produce confident, specific, plausible-sounding extractions from images they effectively cannot read.
A grainy scan of a receipt will produce a confident dollar amount. A low-resolution photo of a whiteboard will return a transcription with plausible words that were not on the board. The model is not guessing randomly — it is doing something worse: pattern-completing from visual priors toward whatever is most plausible given the context.
This happens because models are trained with next-token objectives that reward confident outputs. The training signal does not distinguish "confidently correct" from "confidently plausible-but-wrong." On clean images, these are nearly the same thing. On degraded images, they diverge badly.
Production mitigations:
- Image quality gating: Run a quality classifier before sending to the vision LLM. BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator) scores correlate with model accuracy on document images. Route below-threshold images to human review.
- Confidence extraction: Ask the model to return a confidence score alongside the extraction. This is imperfect but statistically useful — models are better calibrated at the distribution level than at individual predictions.
- Cross-modal consistency checks: For critical fields, extract using both a vision model and a text-based OCR, then flag mismatches. Agreement between approaches raises confidence; disagreement triggers review.
- Schema enforcement with validation: Use structured output and run type/range validation on extracted fields. A date that parses as
9999-01-01or a total that is negative should route to verification automatically.
The 73% production failure rate cited in surveys of multimodal RAG deployments is largely attributed to exactly this problem — teams building on benchmark accuracy numbers without building the validation infrastructure that catches the long tail of degraded inputs.
Architecture Patterns That Actually Work in Production
Beyond the specific failure modes, there are structural decisions that distinguish stable production multimodal systems from fragile ones.
Separate the extraction pipeline from the reasoning pipeline. Extraction — turning an image or document into structured data — and reasoning — answering questions about that data — should be distinct stages with their own evaluation and error handling. Conflating them into a single LLM call makes it impossible to debug which stage failed.
Instrument modality-specific metrics. Text-only LLM pipelines track token counts, latency, and hallucination rates. Multimodal pipelines need additional instrumentation: image resolution distribution (are you sending unnecessarily large images?), OCR confidence histograms, extraction accuracy broken down by document type and image quality. Without these, you are flying blind on the most common failure modes.
Version your preprocessing separately from your prompts. Changing how you preprocess images or documents is just as impactful as changing a prompt. Teams that do not version these changes independently cannot distinguish "the model changed behavior" from "our preprocessing changed the inputs." This matters most when you upgrade a vision model and need to re-evaluate.
Test with adversarial document inputs, not just representative ones. Real-world document sets contain outliers: documents photocopied three times, phone photos taken at an angle, PDFs generated from scanned faxes. Build an adversarial test set explicitly. If your model has never seen a 45-degree rotated receipt, it will encounter one in production.
Where This Is Going
The separation between "document processing" and "vision LLM" is collapsing. Lightweight specialized models like SmolDocling (256M parameters) can parse entire document pages into structured markup with 20× token compression compared to raw vision — sitting in the preprocessing layer rather than replacing the reasoning LLM. ColPali and its variants make visual retrieval practical without OCR infrastructure.
The direction for most production systems is toward composable pipelines: a lightweight classifier routes documents, a specialized extraction model handles the modality-specific work, and a general-purpose LLM handles reasoning over the structured output. This is more complex to build than "pass image to LLM," but it is also far more debuggable, cheaper at scale, and recoverable when any single component fails.
The core insight from teams running these systems at scale: multimodal LLMs are powerful for the tasks text cannot handle, but they require the same engineering discipline that text-only systems needed — evaluation before deployment, modality-aware instrumentation, and explicit failure mode testing. The multimodal models have matured. The operational practices are still catching up.
- https://engineering.grab.com/custom-vision-llm-at-grab
- https://tableflow.com/blog/ocr-vs-llms
- https://www.vellum.ai/blog/document-data-extraction-llms-vs-ocrs
- https://arxiv.org/abs/2407.01449
- https://huggingface.co/blog/manu/colpali
- https://arxiv.org/pdf/2506.16035
- https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models
- https://galileo.ai/blog/survey-of-hallucinations-in-multimodal-models
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://arxiv.org/html/2412.00556v2
- https://medium.com/@rajeev_ratan/how-llms-see-images-and-what-it-really-costs-you-d982ab8e67ed
- https://arxiv.org/pdf/2404.18930
