Skip to main content

Vision Inputs in Production AI Pipelines: The Preprocessing Decisions Nobody Documents

· 10 min read
Tian Pan
Software Engineer

Your vision model benchmarks 90%+ on your eval suite. Then real users upload photos of physical documents, screenshots from low-DPI monitors, and scanned PDFs that have been round-tripped through three fax machines. Accuracy craters. The model "works" — it returns coherent responses — but the responses are wrong in ways that are hard to catch without knowing the ground truth. You file it under "model limitations" and move on.

The model probably isn't the problem. The input pipeline is.

Most teams building with vision LLMs spend enormous effort on prompt engineering and model selection, and nearly zero effort on the preprocessing that happens before the image ever reaches the model. That asymmetry is where production quality goes to die. The preprocessing decisions nobody documents are also the ones responsible for the biggest silent accuracy drops in production multimodal systems.

Resolution: The Silent Token Budget Trade-Off

Every vision model provider has a resolution budget, and they all handle it differently. Claude downscales images that exceed 1.15 megapixels on their long edge (1568px). GPT-4o fits the image into a 2048×2048 bounding box, scales the shortest side to 768px, then tiles into 512×512 chunks. Gemini has its own tiling pipeline. The same 2048×2048 image costs approximately 1,590 tokens on Claude, 1,105 tokens on GPT-4o (high-detail mode), and roughly 280 tokens on Gemini — hiding a 5× cost spread inside an implementation detail.

The cost spread matters less than what happens to the pixels. When a provider silently downscales your input, you have no visibility into what was discarded. For a photo of a business card, downscaling might eliminate the phone number. For a medical scan, it might smooth out the finding you needed the model to detect. Claude's API documentation notes that images under 200px on any edge "may degrade output quality" — but the failure mode is not an error, it's a plausible-sounding but incorrect response.

The MME-RealWorld benchmark (ICLR 2025) used 13,366 real-world images averaging 2,000×1,500px and found that even the best-performing models topped out below 60% accuracy on high-resolution real-world scenarios. The same models score near-saturation on traditional benchmarks. The gap is not primarily model capability — it's the mismatch between the distribution of clean benchmark images and the distribution of images users actually produce.

What to do: Pre-resize images yourself before passing them to the API. Set an explicit target megapixel ceiling, preserve aspect ratio with letterboxing (padding), and log the original and output dimensions. When you control the resize, you know what was discarded. When the provider does it, you don't.

Compression Artifacts: Not the Problem You Think, Plus the One You're Missing

JPEG compression is the first thing teams investigate when they see visual quality issues, and it's usually not the main culprit. Vision Transformers maintain above 98% classification accuracy at JPEG quality factor 85, and the degradation at quality factor 80 is at most 1.22 percentage points on standard classification benchmarks. JPEG compression, within normal web quality settings, is a second-order concern.

The real problem is blur, and specifically motion blur from mobile camera uploads. Deep neural networks are substantially more sensitive to blur and noise than to JPEG compression artifacts or contrast changes — a result that generalizes from image classifiers to vision language models. When a user photographs a physical document with a smartphone in motion, the resulting blur is more damaging than any compression the upload pipeline might apply.

The second compression problem is different in character: compression cliffs in OCR workloads. OCR systems that achieve 97% character-level accuracy at 10× compression ratios can drop to 60% accuracy at 20× compression. This is not a gradual degradation — it's a threshold effect. A pipeline that performs fine on web screenshots and scanned PDFs can fail catastrophically on images captured from printed output that was itself generated from a lossy source.

What to do: Log a blur metric (Laplacian variance is a cheap and effective proxy) on every image your pipeline processes. Set a threshold below which you either reject the input, request a retake, or downgrade your confidence in the model output. Blur is measurable and cheap to compute; not measuring it means you're flying blind on the single biggest input quality degrader.

Aspect Ratio Handling: The Mismatch That Accumulates Silently

The standard advice is to letterbox (pad) rather than stretch images to the target dimensions. The reasoning is correct — stretching introduces systematic shape distortion that affects all tasks requiring proportional reasoning. But the real-world failure mode is subtler than stretch-vs-letterbox.

The problem is consistency. If your training or fine-tuning data was preprocessed with one aspect-ratio strategy, and your inference pipeline uses a different one, the model sees systematically out-of-distribution inputs at serve time. This is easy to introduce accidentally: the fine-tuning data was preprocessed by a data science team using a library that defaults to stretching; the inference pipeline was written by an engineering team that independently chose to letterbox. Both choices are reasonable in isolation. The combination is a silent regression.

GPT-4o's tiling pipeline introduces a related problem: unusual aspect ratios can split semantically important regions across tile boundaries. A wide screenshot of a data table might get cut horizontally through the middle of the table header. The model receives two tiles that each look coherent but lack the structural context that made the original image interpretable.

What to do: Document your aspect-ratio handling decision once and enforce it consistently across training, evaluation, and inference. Treat it as an API contract: aspect_ratio_strategy: letterbox | pad_color: white | target: 1024x1024. If you're using a provider's tiling pipeline and you can't control tile boundaries, be aware that wide panoramic images and tall document scans are the highest-risk aspect ratios.

OCR Pre-Processing: Decoupling Text Failure from Visual Failure

The largest category of production vision failures is text extraction from document images. The failure mode is particularly expensive because the model response usually looks correct — grammatically coherent, formatted appropriately — while containing wrong numbers or misread characters that only surface downstream.

Reliable OCR requires a minimum effective resolution of 300 DPI. Mobile camera uploads of physical documents frequently come in at 72–150 effective DPI. Below this threshold, character-level recognition becomes unreliable regardless of the model's capability. The failure is not that the model lacks OCR capability — it's that the input doesn't contain enough information to recover the text.

The PreP-OCR pipeline research establishes that deskewing, denoising, and contrast enhancement are prerequisite steps before passing document images to vision language models. These are not optional quality improvements; they're the difference between processable and unprocessable input. A document scanned at a 3-degree angle with uneven lighting can be correctly read after preprocessing and fail entirely without it.

The practical engineering pattern here is to decouple text recognition failure from structural understanding failure. Run a traditional OCR pass (Tesseract or a commercial service) on document images, and pass both the raw image and the extracted text to the vision model. This gives you two independent failure signals: if the OCR text is garbled, the input quality is the problem. If the OCR text is clean but the model response is wrong, the model reasoning is the problem. Without this separation, you're debugging a black box.

The Eval-Deployment Gap: Why Your Test Suite Lies

The pattern behind all of the above failures is the same: evaluation data is cleaner than production data, and models learn to exploit that cleanliness.

Your eval suite was built with developer-generated screenshots, well-lit product images, and PDFs that were never printed and re-scanned. These images are sharper, better-framed, and more consistently formatted than anything a real user will upload. When the model scores 90%+ on this eval, it's demonstrating that it can handle inputs that look like your test cases. It is not demonstrating that it can handle inputs that look like a 62-year-old user photographing a utility bill under fluorescent lighting with a cracked phone screen.

The research on vision-language model robustness to image artifacts is consistent on this point: models can detect strong, obvious artifacts at low rates (detection rates around 11–19% in controlled studies), and weak artifacts — the kind that would look fine to a human reviewer — reduce accuracy by 3–10 percentage points without triggering any observable failure signal. The model doesn't say "I can't read this." It reads it wrong.

A deeper problem is that vision-language models frequently default to text-trained priors rather than visual evidence, particularly on images that resemble training data distribution. This means a model can appear to reason visually on clean eval screenshots — giving correct answers by pattern-matching to memorized content — while failing on production images that require actual visual grounding. The eval looks impressive. The deployment doesn't.

What to do: Build a shadow set of production images. Capture a sample of real user-uploaded images (with appropriate consent and privacy handling), run them through your model, and manually review a subset. Calibrate your actual production accuracy against your eval accuracy. If the gap is larger than 10 percentage points, your preprocessing pipeline needs work before your model does.

A Minimal Normalization Pipeline

The normalization pipeline that separates model failures from input failures is not complicated. It is, however, systematically skipped.

The components are:

  • Validation gate: reject images below a minimum edge length (200px is the Claude-documented floor), check for corrupt or unsupported formats, and return a structured error rather than silently passing bad input.
  • Resolution normalization: resize to your target megapixel budget, preserve aspect ratio with letterboxing, and log the original and resized dimensions per request.
  • Blur detection: compute Laplacian variance, log it, and flag requests below your threshold for downstream analysis.
  • Format standardization: convert WebP and GIF to PNG or JPEG before passing to the API; provider pipeline behavior for these formats is less well-documented and can vary.
  • OCR pre-pass for text-heavy inputs: for document processing, run a traditional OCR pass and pass both image and extracted text to the model; log OCR confidence scores alongside model outputs.
  • Observability logging: log input dimensions, blur score, OCR confidence, token cost, and model latency per request. Without this, you cannot distinguish a model regression from an input quality regression.

None of these require custom models or significant engineering investment. Most can be implemented in a weekend. The common failure mode is not that they're hard to build — it's that teams don't build them until an incident makes the cost of not having them visible.

Conclusion

The conversation around multimodal AI engineering tends to focus on model capability: which vision model has the highest benchmark score, which architecture handles resolution best, which provider's tiling algorithm is most efficient. These are real considerations, but they're second-order compared to the preprocessing discipline upstream of the model call.

A team that builds a rigorous normalization pipeline — validating inputs, controlling resize behavior, detecting blur, decoupling OCR failures from model failures, logging quality metrics per request — will outperform a team with a better model but no pipeline hygiene. Not because their model is better, but because they know when their model is failing and why.

Your eval suite on clean screenshots is telling you how your model performs on clean screenshots. That's useful information. It is not a prediction of production accuracy. Build the pipeline, instrument it, and then you'll know what your model actually does.

References:Let's stay in touch and Follow me for more thoughts and updates