Skip to main content

Multimodal Pipelines in Production: What Breaks When You Go Beyond Text

· 11 min read
Tian Pan
Software Engineer

Most LLM engineering wisdom — caching prompts, tuning temperature, budgeting tokens — assumes text goes in and text comes out. Add an image, a PDF, or an audio clip and almost none of that wisdom transfers. The preprocessing is different. The failure modes are different. The cost model is different. And the eval suite you built for your text pipeline won't catch the new things that break.

About 50% of enterprise knowledge lives in non-text formats: PDFs, slides, scanned forms, product images. Teams that reach that data discover that going multimodal isn't just adding a modality — it's adding an entirely new engineering surface.

The Benchmark Gap Is Larger Than You Think

Before talking about what breaks in production, it helps to understand how optimistic your benchmarks are. A model that reports 97% OCR accuracy on a clean benchmark dataset routinely hits 75–80% on financial documents in production. On complex financial reports, edit distance errors are nearly double what benchmark numbers suggest.

The cause is distribution mismatch. Benchmark datasets are curated, clean, and representative of problems the model was trained on. Your production data is none of those things. It includes multi-column layouts, scanned documents with skew and noise, tables with merged cells, compressed images that passed through a CDN. The gap between benchmark and production isn't a model quality problem — it's a data quality problem your eval suite hasn't been designed to surface.

One practical calibration: if a model achieves X% on a standard benchmark for your task, assume 10–20 percentage points lower in production until you have evidence otherwise.

Image Inputs: Token Economics and Silent Failures

The first surprise when wiring images into a pipeline is that the cost model is nothing like text. Image cost is proportional to pixel area. A 1000×1000 image costs roughly 1,334 tokens. A 200×200 image costs about 54. The 25x size difference produces a 25x token cost difference — but doesn't necessarily produce better results.

Several hard limits cause silent failures that are difficult to debug:

  • Images larger than the API provider's maximum dimension are rejected outright, not resized
  • Sending more than 20 images in a single request imposes stricter per-image size limits
  • Payloads exceeding request size limits (commonly 32 MB) get rejected before any model inference runs

None of these failures look like model errors. They look like network errors, timeout errors, or empty responses depending on how the client handles them.

A less obvious performance issue: images wider than about 1,568 pixels get auto-downsampled server-side before inference. You pay the full token count of the original image and absorb the TTFT penalty of the resize, but the model sees the smaller version. Pre-resizing images to a sensible maximum before sending them costs nothing and removes both the latency and the billing inefficiency.

JPEG compression is a related trap. Compression artifacts that are invisible to a human reader — JPEG quality 60, a ratio above 10x — can collapse OCR accuracy from above 90% to below 60%. CDNs and upload pipelines frequently apply this compression automatically. If your images pass through any intermediate storage or delivery layer, verify that compression settings aren't silently degrading the inputs your pipeline depends on.

PDF Extraction: The Dominant Source of Production Outages

PDF is the highest-volume multimodal input in most enterprise settings, and it causes the most production failures. The core problem is that "PDF" describes a file format, not a content type. A single document pipeline has to handle:

  • Searchable PDFs with clean text layers
  • Scanned documents with OCR overlays of varying quality
  • Multi-column academic papers where linear text extraction jumbles the reading order
  • Dense financial reports with tables, footnotes, and sidebars that interleave with the main text

No single extraction approach handles all of these correctly. Sending every PDF through the same pipeline is the most common architectural mistake.

Table extraction deserves particular attention because tables appear frequently in the high-value documents (financial reports, contracts, research papers) that teams most want to process. Common failure modes in production:

  • Column misalignment: text from separate columns merges, producing incoherent context
  • Cell relationship loss: merged cells and spanning headers aren't preserved; what was a structured table becomes unordered text
  • Reading order disruption: footnotes and sidebars appear mid-paragraph
  • Formula corruption: mathematical notation becomes garbled tokens

About 30% of production PDF extraction failures trace to table misalignment specifically. The fix is rarely a better extractor — it's detecting document type before choosing the extraction strategy. A lightweight classifier that separates simple text PDFs, complex-layout PDFs, and scanned images takes milliseconds and routes each to the appropriate tool.

A practical insight from teams that have solved this: preserving whitespace and tab spacing in the extracted text often works better than reconstructing formal HTML table markup. The model can comprehend spatial layout when it's represented in the text; elaborate DOM structures add complexity without proportional gain.

Edge case distribution compounds the routing problem. Typically 15–20% of a production document corpus causes 60–80% of extraction failures. Those edge cases tend to be structurally distinct from the common case — unusual layouts, non-standard fonts, degraded scans. Building an accurate eval for them requires real production samples, not synthetic or benchmark data. About 500–1,000 targeted annotations on edge case examples resolve the majority of production failures faster than any generic improvement to the extraction stack.

Audio Pipelines: The 300ms Budget

Voice pipelines have a latency constraint that doesn't exist for text or images: conversational responses need to begin within about one second of the user stopping speaking, or the experience feels broken.

The sequential pipeline architecture — transcribe the audio, send text to the LLM, synthesize the response — introduces 2–4 seconds of delay. That's disqualifying for interactive use. The fix is streaming at every stage: begin transcription while audio is still arriving, begin LLM generation as soon as the first words are transcribed, begin TTS synthesis as the first LLM tokens arrive.

Even with full streaming, the budget is tight. A rough breakdown:

  • Audio transport: under 50ms
  • Speech-to-text first partial result: 100–200ms
  • LLM time-to-first-token: 200–400ms
  • TTS time-to-first-audio: 100–300ms

Miss any stage and the accumulated latency exceeds the threshold. Monitoring each stage separately matters — a regression in STT latency is invisible in end-to-end metrics until it consistently pushes the total over the threshold.

Two preprocessing steps that are easy to skip but have high ROI: voice activity detection (VAD) strips silence before the audio reaches billing, reducing costs 30–40% for call recordings with natural pauses. Model-based turn detection, rather than VAD-only turn detection, handles the common case where speakers pause mid-thought without triggering a premature LLM response.

Cross-Modal Hallucination: A Failure Mode That Doesn't Exist in Text Pipelines

Text-only LLMs hallucinate by generating plausible but unsupported text. Multimodal models have an additional failure mode: generating descriptions of images based on what the model expects to see rather than what's actually there.

The mechanism is language prior dominance. When visual evidence is ambiguous — low resolution, heavy compression, low contrast — the model falls back to generating statistically likely descriptions from its training distribution. A medical image with unusual presentation might receive a description that fits common pathology even when the actual image shows something different. A product image with heavy JPEG artifacts might have its color or text description filled in from learned associations.

Four distinct hallucination types appear in multimodal models that don't occur in text pipelines:

  • Object hallucination: describing elements not present in the image
  • Attribute hallucination: correct object, wrong properties (color, size, count)
  • Relational hallucination: spatial relationships inverted or fabricated
  • Fabricated descriptions: invented content with no visual grounding

A practical diagnostic is the blank-drop test: measure model accuracy with the image present vs. with a blank image substituted. A model with low accuracy degradation on blank drop is relying on language priors rather than visual evidence. Better images won't fix this; it requires model-level intervention.

Security: Every Modality Expands the Injection Surface

Text-only pipelines have prompt injection as a security concern. Multimodal pipelines have it across every modality.

Adversarial instructions embedded in images — invisible to human reviewers but interpreted by the model — have been demonstrated against every major vision-language model. The attack succeeds even under stealth constraints that limit how conspicuous the injected content is.

Audio injection is a newer and more operationally concerning variant. A sub-second adversarial audio burst prepended to legitimate input causes the transcription model to treat the legitimate content as secondary or end-of-input. Success rates above 85% have been demonstrated against current production models, including in physical-world conditions where the audio is played through a speaker in a room rather than injected directly into the audio stream.

The mitigations worth implementing in production:

  • Re-encode images before processing — this strips adversarial pixel-level patterns
  • Run a dual-model architecture for sensitive operations: one model extracts content from the multimodal input, a separate model acts on the extracted text without access to the original input
  • Treat every user-supplied image or audio file as untrusted input, not as a passive data source

Serving Architecture: Modality Interference

When text-only and image-text requests share the same inference workers, they interfere. The visual encoder preprocessing that happens before LLM inference introduces variable overhead that predictably causes latency spikes for co-located text requests. This interference pattern doesn't appear in benchmarks that run each modality type in isolation.

The recommended pattern for production is separate worker pools per modality type with modality-aware autoscaling. The scaling signals are different: text workloads scale on request count and tokens per second; image workloads scale on pixel throughput and encoder compute. Mixing these produces autoscaling rules that respond correctly to neither.

Production multimodal workloads also exhibit heavy-tailed distributions — a small number of very large documents or very long audio files dominate compute consumption. A cost attribution system that tracks per-request compute by modality type rather than just tokens makes this visible and allows capacity planning to account for it.

Building an Eval Suite That Actually Catches Multimodal Failures

Standard text evals miss entire categories of multimodal failure. An eval suite for a multimodal pipeline needs three layers:

Offline testing with production samples. Benchmark datasets don't represent your data. Curate a test set from real production inputs, specifically including documents and images from the edge case population that causes the majority of failures. Domain-specific metrics (line-item reconciliation accuracy for financial documents, attribute extraction precision for product catalogs) tell you more than generic accuracy scores.

Online monitoring with targeted sampling. Running a judge on every request is prohibitively expensive for multimodal inputs. Adaptive sampling — selecting requests based on modality type, confidence signals, or output anomalies — gives coverage at manageable cost. Visual reliance score (comparing performance on correct vs. mismatched image-question pairs) detects language prior dominance without requiring labeled ground truth.

Regression control from production failures. Every production failure that reaches a user should become a fixed test case. This is more important for multimodal pipelines than text pipelines because failure modes are harder to anticipate and the distribution of inputs is wider. Treat the eval suite as production infrastructure, not a launch checkpoint.

The Pattern That Holds

Teams that have stabilized multimodal production pipelines converge on a few common decisions: they classify input complexity before routing to extraction strategies, they run input validation (resolution checks, format validation, size limits) before any LLM call, they separate serving infrastructure by modality type, and they build evals from production data rather than benchmarks.

The underlying principle is that multimodal inputs are not interchangeable. Text tokens have predictable cost, predictable quality, and predictable failure modes. Image tokens, audio tokens, and PDF pages have none of these properties without explicit preprocessing and routing logic in front of them. The teams that treat multimodal as a drop-in extension of their text pipeline pay for that assumption in production.

References:Let's stay in touch and Follow me for more thoughts and updates