Multimodal LLM Inputs in Production: Vision, Documents, and the Failure Modes Nobody Warns You About
Adding vision to an LLM application looks deceptively simple. You swap a text model for a multimodal one, pass in an image alongside your prompt, and the demo works brilliantly. Then you push to production and discover that half your invoices get the total wrong, tables in PDFs lose their structure, and low-quality scans produce confident hallucinations. The debugging is harder than anything you faced with text-only systems, because the failures are visual and the LLM will not tell you it cannot see clearly.
This post covers what actually goes wrong when you move multimodal LLM inputs from prototype to production, and the architectural decisions that prevent those failures.
Vision Tokens Are Not Free — And They Scale Quadratically
The first production surprise is cost. Vision tokens are priced 2–5× higher than text tokens, and the number of tokens consumed scales quadratically with image resolution, not linearly.
GPT-4o's high-detail mode works as follows: the image is tiled into 512×512 pixel chunks, and each tile costs 170 tokens, plus an 85-token base. A 1024×1024 image produces four tiles — 765 tokens just for one image. Scale that to a document pipeline processing 50,000 invoices per day and the numbers become significant before you have even written a single prompt.
Claude's tile budget is similar: images are broken down into grids capped at 1568 pixels on the longest edge, with each block contributing to the token count. The recommendation from providers to "use the lowest resolution that works for your task" sounds obvious until you realize that "lowest resolution that works" requires empirical testing per document type, not a single global setting.
The optimization path involves:
- Downscaling images to provider maximums (1568px for Claude, 2048px for GPT-4o) before sending — most production images from phone cameras arrive at 4032×3024
- Using provider-specific low-detail modes for tasks that only need gross visual features (classification, routing)
- Caching images when the same document is queried multiple times — prompt caching for images reduces repeated costs by 90%
What many teams miss: vision encoder latency dominates at high resolutions. When you send a 4K image, the bottleneck is not LLM inference — it is the vision encoder processing time, which increases your time-to-first-token substantially. Latency and cost move together here.
The OCR vs. Native Vision Decision Is Not Binary
The common framing is: "Should I use OCR, or just pass the image to the LLM directly?" That framing leads to the wrong architecture. The correct question is: "What is the failure mode I am trying to prevent, and which approach avoids it?"
Native vision excels when:
- Document templates vary widely (invoices from 200 different vendors, each with a different layout)
- The task requires semantic reasoning alongside extraction ("what is the primary line item?")
- Documents contain handwriting or mixed scripts
- You need to extract information from charts, diagrams, or images embedded within the document
Traditional OCR wins when:
- You need deterministic, reproducible output (financial records, compliance documents)
- The text is clean and well-structured (machine-printed forms)
- You need character-level accuracy above 97%
- Cost is the primary constraint and the document format is stable
The practical trap: teams pick native vision because it handles layout variation, then discover their model hallucinates on the 5% of documents with degraded scan quality, producing confident but wrong extractions. The LLM does not say "I cannot read this clearly." It invents a plausible value.
The emerging production pattern is a two-stage hybrid. A lightweight quality classifier routes documents: clean, high-quality scans go to a fast OCR pipeline, while complex or degraded documents go to the vision LLM. Grab's engineering team building their document processing system (Documint) took this direction — they found that proprietary LLMs alone were too slow, too expensive, and hallucinated on Southeast Asian language documents (Thai, Vietnamese) without specialized training data. Their custom vision LLM, built on Qwen2-VL fine-tuned with synthetic OCR data, achieved 50% speedup over the base model while maintaining accuracy on non-Latin scripts.
The lesson: benchmark accuracy scores from providers are measured on clean benchmark datasets. Your documents are not clean benchmark datasets.
PDF Tables Break Most Document Pipelines
- https://engineering.grab.com/custom-vision-llm-at-grab
- https://tableflow.com/blog/ocr-vs-llms
- https://www.vellum.ai/blog/document-data-extraction-llms-vs-ocrs
- https://arxiv.org/abs/2407.01449
- https://huggingface.co/blog/manu/colpali
- https://arxiv.org/pdf/2506.16035
- https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models
- https://galileo.ai/blog/survey-of-hallucinations-in-multimodal-models
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://arxiv.org/html/2412.00556v2
- https://medium.com/@rajeev_ratan/how-llms-see-images-and-what-it-really-costs-you-d982ab8e67ed
- https://arxiv.org/pdf/2404.18930
