Skip to main content

Multimodal LLMs in Production: The Cost Math Nobody Runs Upfront

· 11 min read
Tian Pan
Software Engineer

Most teams add multimodal capabilities to an existing LLM pipeline without running the cost math first. They prototype with a few test images, it works, they ship — and then the first billing cycle arrives. The number is somewhere between embarrassing and catastrophic, depending on volume.

The problem isn't that multimodal AI is expensive in principle. It's that each modality has a distinct token arithmetic that compounds in ways that text-only intuition doesn't prepare you for. A single configuration parameter — video frame rate, image resolution mode, whether you're re-sending a system prompt every turn — can silently multiply your inference bill by 10x or more before you've noticed anything is wrong.

The Image Token Arithmetic

Images are not free context. They are converted to tokens before the model sees them, and the token count depends on image dimensions, not file size.

GPT-4o's formula is the most transparent: 85 + (number_of_tiles × 170) tokens, where tiles are 512×512 pixel sections of the image after scaling. A 512×512 image costs 255 tokens. A 1024×1024 image costs 765 tokens. A high-resolution document scan at 4000×6000 pixels can cost over 16,000 tokens — roughly $0.041 per image at current input pricing, before you've written a single token of output.

Claude uses a different formula — approximately (width × height) / 750 tokens — and auto-scales images if the long edge exceeds 1568px or if the image would exceed ~1,600 tokens. That means most "large" images are capped, but the cap is still expensive: a typical image at maximum dimensions costs around 1,333–1,600 tokens. At Haiku pricing that's ~$0.0016 per image; at Opus it's ~$0.008.

Gemini's formula isn't publicly documented in the same way — providers disclose it differently — but empirically lands around 560 tokens for a standard image.

None of these numbers feel alarming until you do the volume math. 10,000 image queries per day at GPT-4o high-detail pricing (averaging 765 tokens per image) = 7.65 billion input tokens per month = approximately $19,125/month just for image input, before output, before text context, before system prompts. That's assuming one image per query. Teams processing invoices, product photos, or document batches often send 3–10 images per request.

The single most impactful optimization available for images costs nothing to implement: switch simple classification or presence-detection tasks to low-detail mode (85 tokens, fixed, regardless of image dimensions). That's a 9x token reduction for any task where fine visual detail isn't necessary. Teams discover this optimization after their first unexpected bill, rarely before.

The other lever is resolution gating. Resizing images to a maximum of 512 pixels on the long edge before submission significantly reduces tile count. JPEG compression at quality 85 reduces payload size for transmission without meaningful accuracy loss for AI tasks. Neither of these requires model changes.

Audio: Cheap Until It Isn't

Audio billing looks simple on the surface: most providers charge per minute of audio processed. Whisper and GPT-4o Transcribe: $0.006/minute. Deepgram: roughly $0.0077/minute. The math feels tractable even at scale.

The trap is the realtime audio API. When you move from batch transcription to a conversational voice interface — where the LLM is responding to audio turn-by-turn — the billing model changes completely. Realtime APIs charge in audio tokens: approximately $40 per million input audio tokens and $80 per million output audio tokens. That translates to roughly $0.06/minute for audio input and $0.24/minute for audio output in a live conversation.

That's already 10x more expensive than batch transcription. Then add the system prompt. In a realtime session, most implementations re-send the full system prompt on every turn. A 1,000-word system prompt re-sent every conversation turn adds approximately $1.63/minute to cost — more than the audio itself. A full voice agent with a reasonably sized system prompt can easily run $2+/minute in inference cost alone, before platform, telephony, or TTS costs.

The mitigation for system prompt cost is aggressive context caching and truncating conversation history. The mitigation for total voice pipeline cost is usually architectural: decide early whether you need realtime audio understanding (LLM hears audio natively) or whether batch STT → text → LLM is sufficient. The latter is dramatically cheaper for most use cases. A complete batch voice pipeline (STT + LLM + TTS) runs approximately $0.07–0.22/minute total. Realtime LLM audio runs $0.30–$2+/minute depending on prompt size.

Audio quality also directly affects both cost and accuracy in ways teams underestimate. Word error rate roughly doubles as signal-to-noise ratio drops from 15 dB to 5 dB. In real environments — call centers, shop floors, outdoor settings — SNR routinely falls into the 5–10 dB range. Models achieving 95% accuracy on clean audio datasets collapse to below 70% accuracy in high-noise environments.

The counterintuitive finding from speech recognition research: preprocessing pipelines designed to remove noise often make things worse. Spectral subtraction improves measured SNR by ~8 dB but can increase WER by 15% because it strips speech harmonics along with the noise. The better approach is end-to-end models trained on noisy data — which process raw audio without the preprocessing step that introduces errors while trying to help.

Video: Where the Multipliers Get Dangerous

Video is where multimodal cost arithmetic becomes genuinely dangerous, because it contains a default configuration parameter most practitioners never touch: the frame sampling rate.

Gemini's video understanding API defaults to 1 frame per second. A 60-second video at default settings = 60 frames. Each frame is approximately 258 visual tokens. Add 32 audio tokens per second. Total: 17,400 tokens for one minute of video. That's still manageable.

Now consider what happens if someone overrides the default to 24 FPS for higher accuracy on fast-motion content. Sixty seconds at 24 FPS = 1,440 frames = approximately 374,400 tokens. The cost difference: 21.5x, from a single parameter change. At a volume of 100 videos per day, that's the difference between ~$104/day and ~$2,235/day. Annualized, the wrong FPS setting costs roughly $780k more per year.

The correct frame rate depends entirely on content type. For lecture recordings, product demos, or anything with a static camera and slow-moving content, 0.1 FPS often captures all the semantically unique frames. For surgical videos or sports analysis, 1 FPS is genuinely insufficient. The cost-optimal architecture applies content-aware frame selection before submission: extract at base rate, apply perceptual hashing to deduplicate near-identical frames, apply scene detection to identify unique segments, then submit only keyframes. This approach achieves 13–45% token reduction depending on video type, with combined deduplication plus scene detection cutting costs by up to 83% on content-dense demo videos.

Context caching adds another major lever for video. Gemini's implicit caching (active by default since mid-2025) gives a 90% discount on cached tokens for repeated analysis of the same video. A 5-minute video costs approximately $0.18 to analyze the first time; subsequent queries against the same cached video cost ~$0.018. For workflows analyzing the same video repeatedly — QA, medical review, content moderation — this changes the economics entirely.

Failure Modes That Don't Show Up in Benchmarks

Beyond cost, each modality has degradation patterns that standard benchmarks don't capture well.

For images: the "resolution curse" is a known but underappreciated failure mode. Vision encoders in most models downsample to a fixed internal resolution — often 384×384 pixels — before processing. A high-resolution source image with fine detail (handwriting, small text, crowded labels) can lose that detail entirely in the encoding step. This means sending a higher-resolution image doesn't always produce better results; it can produce identical or worse results at significantly higher token cost. The actual resolution at which accuracy plateaus varies by model and task and is rarely documented.

Multi-image requests degrade not because models handle multiple images poorly in principle, but because they push total token counts into the range where context attention quality drops. Research on context rot shows observable performance degradation beginning around 2,500 tokens, with widespread degradation above 5,000–10,000 tokens. A batch of ten high-detail images at 765 tokens each = 7,650 image tokens before any text is added. This isn't a theoretical concern; production teams processing document packages with multiple pages routinely hit this ceiling.

For audio: the key failure mode is over-engineering the preprocessing stack. Every denoising, normalization, or VAD filter in a preprocessing chain is an opportunity to strip information the model uses. The preprocessing steps that look like improvements on a clean benchmark dataset often hurt accuracy on the long tail of real-world inputs. The practical recommendation: invest in models trained on noisy data rather than preprocessing pipelines, and measure WER across SNR ranges rather than only on clean test sets.

For video: temporal reasoning is still genuinely weak in current models. They fail to accurately interpret cause-and-effect sequences, confuse spatial directions (left vs. right) in moving scenes, and miss frame-to-frame changes when the change is subtle but semantically important. Frame sampling reduces costs but exacerbates these failures — a 1 FPS sample of a 30 FPS action sequence captures only 3.3% of frames. The models that perform best on video benchmarks are often tested on content types where 1 FPS is adequate; production workloads aren't always so cooperative.

The Production Architecture That Actually Works

Teams that run stable multimodal pipelines at meaningful volume typically converge on the same set of architectural patterns.

Input validation before the LLM call. Image quality gating (blur detection, minimum resolution check, format validation) catches bad inputs before they consume tokens and return garbage. Audio VAD (voice activity detection) strips silence before billing, reducing audio costs by 30–40% for real-world call recordings where pauses are frequent.

Modality-aware model routing. Not every modality is equally well-served by every model. GPT-4o performs well on mixed text-and-image reasoning tasks. Gemini 2.5 leads on native video and audio understanding. Claude Sonnet demonstrates the best resilience across image quality degradation on document extraction benchmarks. Routing by modality and task type — rather than sending everything to one model — exploits relative strengths while controlling cost.

Separate OCR vs. native vision for structured documents. LLM vision processing costs $0.20–$1.00 per document. Dedicated OCR APIs cost $0.01–$0.05 per page. For high-volume structured extraction (invoices, forms, receipts), a dedicated OCR pass followed by LLM post-processing often achieves the same accuracy at a 10–20x lower cost per document. Reserve native LLM vision for tasks requiring layout reasoning, table interpretation, or complex visual context.

Cascade fallback with health tracking. Multimodal API calls fail at higher rates than text-only calls, and degradation is often silent — the model returns an answer, just a worse one. Production pipelines benefit from explicit health state tracking (healthy → degraded at 2%+ error rate → failed at 5%+ error rate → recovering) with fallback chains: primary model → cheaper alternative → cached response → graceful error. Semantic caching eliminates provider dependency for queries that repeat.

Explicit cost attribution per modality. Observability stacks that track tokens per query, broken down by modality, catch the configuration drift that turns a profitable workload unprofitable. A single engineer switching video sampling from 0.1 FPS to 1 FPS, or from low-detail to high-detail images, in an unmonitored pipeline can generate a 10x bill increase that takes weeks to discover.

Running the Math Before You Ship

The patterns for controlling multimodal costs exist. None of them are secret. What's missing in most teams' planning is the habit of running the arithmetic before the first production deployment rather than after the first billing surprise.

For images: calculate (daily image queries × average token count per image × input token price). Then calculate the same with low-detail mode and compare. The gap usually exceeds 5x and takes about ten minutes to compute.

For audio: distinguish batch transcription workloads from realtime conversational workloads and price them separately. The two don't belong in the same cost model.

For video: identify your actual content type and frame change rate before accepting the default FPS setting. Test perceptual hashing deduplication on a sample of your real content to get an empirical frame reduction estimate.

The teams that avoid the billing shock aren't doing anything exotic. They ran the math in a spreadsheet before writing the integration code.

References:Let's stay in touch and Follow me for more thoughts and updates