Multimodal Eval Drift: Why Your Image and Audio Paths Regress While Text Stays Green
The dashboard says quality is up two points this release. The text-eval suite ran clean. Your model provider shipped a new checkpoint that beats the prior one on every public benchmark you track. You roll forward. A week later the support team flags a quiet but persistent uptick in tickets about uploaded screenshots — users say the model is "reading the wrong numbers from the chart" or "missing a row in the table." Audio transcription complaints follow a few days later, mostly from non-American English speakers. None of it shows up in your eval pipeline. The release looks healthy. It isn't.
This is multimodal eval drift, and almost every team that bolted vision and audio onto a text-first stack is shipping it. The eval discipline that worked for text — gold sets, LLM-as-judge, drift dashboards, an aggregate score that gates the release — extends to multimodal in name only. The failure rates per modality are not commensurable, the rubrics that catch text errors don't catch image errors, and the labeling pipeline that produced your text gold set is calibrated to a workload that ships every six months, not to a multimodal regression that arrives with every checkpoint update.
The right mental model is that multimodality is not a flag on the same model — it is a different product surface with a different failure distribution, and the eval discipline that ignored that distinction is shipping silent regressions every model release.
Why Text Evals Are Blind to Vision and Audio Regressions
Text evaluation has converged on a fairly stable set of techniques: a curated gold set with reference outputs, a judge prompt that scores responses on relevance, factuality, and tone, and a dashboard that tracks aggregate quality across releases. That stack works because text failures cluster around facts you can verify and behaviors you can describe in a rubric.
The vision path doesn't share that property. A model that hallucinates fewer text facts can simultaneously misread chart numbers more often. The two error rates are weakly correlated — improvements in language modeling don't automatically transfer to visual grounding, and provider-side post-training emphasis can shift between releases without disclosure. Recent OCR and chart benchmarks (CC-OCR, ChartQA, DocVQA) consistently show that the same checkpoint can lead on summarization tasks while regressing on table reading or numeric extraction from charts. If your eval rolls those into a single quality number, you can't see it move.
Image regressions also tend to express themselves as confidence shifts rather than factuality shifts. The model still produces a confident answer. The answer is wrong because the model misidentified the cell, miscounted the bars, or merged two adjacent columns. Your text rubric — "is the answer factually correct" — handles that case poorly because there's no easy ground truth without visual annotation. LLM-as-judge over text-only outputs makes this worse: the judge sees the same wrong answer your model produced, has no access to the image, and rates it as plausible.
Audio is its own surface. Whisper-class transcription degrades along axes that text inputs never have: accent (multiple peer-reviewed studies have measured uneven word-error rate across regional accents and demographic groups), codec (VoIP and WebRTC compression chains add distortion that academic benchmarks underrepresent), and ambient noise. An ASR model upgrade can improve clean-audio benchmarks while regressing on the actual distribution your users send — phone calls, voice memos, meeting recordings — because the public benchmarks are clean, and your traffic is not. Downstream, the LLM that consumes the transcript inherits every transcription error as a false fact, and the resulting wrong answer looks like a generation failure.
The Failure Mode Most Teams Hit First
The pattern repeats across teams. You launch the text product. Quality discipline matures around text — eval coverage, regression gates, an LLM judge with a written rubric. Then the multimodal feature ships as the second feature on the same stack. The product team adds image upload to the chat interface in a sprint. The eval team adds twenty image cases to the existing gold set, the LLM judge gets a one-line addition that "evaluate vision responses too," and the release pipeline keeps reporting a single aggregate quality score.
Three months later a model upgrade comes out. The text suite passes. The aggregate quality is up. Image-related support tickets quietly climb 30%. The eval team has no signal because:
- The image cases are a rounding error in the gold set. Twenty of two thousand is one percent. A 50% regression on those twenty looks like a 0.5% drop in aggregate, well inside noise.
- The LLM judge can't grade image outputs without seeing the image. Most judge harnesses pass the model's response to a text-only judge. The judge has no access to the original image and rates the response on its prose, not its grounding.
- The rubric items don't fit. "Was the chart axis labeled correctly?" is not in the text rubric. "Was the table column aligned?" isn't either. The judge defaults to scoring fluency.
By the time you trace the support tickets back to a vision regression, the model has been live for two weeks and the rollback path is messy because other product changes shipped on top of it.
The Discipline That Has to Land
The fix is not "add a vision metric to the dashboard." It is a structural change in how you evaluate multimodal.
Separate eval suites per modality, no aggregate quality score across modalities. The failure rates per modality are not commensurable. A 2% text regression and a 2% image regression are different events. They have different blast radii (text traffic is usually higher volume; image traffic often hits higher-stakes use cases like document processing), different rollback decisions (text regressions might be tolerable if they're stylistic; image regressions on a PDF parser usually aren't), and different remediation paths. Aggregating them into one number erases all of that. Your release gate should require non-regression on each modality independently, not non-regression on the average.
Modality-specific rubrics that name the actual failure modes. For images: chart misreads (wrong number, swapped axes), OCR drift on rotated or low-DPI scans, table-structure errors (merged columns, dropped rows, hallucinated header cells), spatial-reasoning gaps on diagrams, and identification confidence on edge-case visuals (handwriting, watermarks, multi-column layouts). For audio: transcription confidence on noisy inputs, accent-stratified word-error rate, codec-stratified word-error rate, and downstream task accuracy on transcripts (which catches the case where transcription is "good enough" but wrong on the load-bearing words). Whisper's own evaluation history is instructive — performance varies markedly across English accents and degrades under codec compression in ways the public benchmarks underrepresent.
Multimodal LLM-as-judge, with the image or audio passed to the judge. Text-only judging of multimodal outputs is fundamentally unable to detect grounding failures. Recent multimodal-judge frameworks (MLLM-as-a-Judge benchmarks, Patronus Judge-Image, modality-aware extensions in lmms-eval) show meaningful — but imperfect — alignment with human scoring on visual tasks; the headline number from the original MLLM-Judge work was 0.557 similarity to human scoring on GPT-4V, which is useful but tells you the judge itself needs validation against a human-graded slice. Don't trust a multimodal judge you haven't calibrated.
A labeling pipeline that produces multimodal gold data at the cadence of model upgrades. This is the part teams under-budget. Multimodal labeling costs roughly N× what unified text labeling does — every image needs a human to look at it; every audio sample needs someone to listen. If your text gold set refreshes annually but your model provider ships a checkpoint every six weeks, your multimodal eval set is stale by the second release. The cheapest multimodal eval set you have is the one you can't refresh, and it's also the one that will silently lose its discriminating power as the model improves on the cases you originally chose.
A release gate that refuses to roll forward on per-modality regression. This is the operational change. Stop shipping when text passes and the aggregate looks fine. Require explicit non-regression per modality, with the threshold tuned to the variance of that modality's eval set. Image evals have higher per-sample variance than text — a single mislabeled chart can swing your score more than a paragraph of text can — so the threshold needs to be set against the right variance.
Why the Cost Frame Always Sinks the Project
The reason teams under-invest here is that per-modality eval discipline costs roughly N times the unified eval. You need separate gold sets per modality, separate rubrics, separate judge prompts (or a multimodal judge that costs more to run), and separate labeling pipelines. The team that argues for it is asking for budget that the team that argues against it can credibly call premature optimization. "We'll do it after the first regression."
That's almost always the wrong call, because the first regression is the event that makes the case impossible to argue against — and at that point you've already eaten the cost of the regression in support tickets, customer trust, and the engineering hours spent triaging "the model is being weird about images." The pre-regression budget conversation never wins; the post-regression conversation is too late to prevent the harm.
The pragmatic compromise: treat per-modality discipline as an investment that scales with modality maturity. Day-one multimodal launch can ship with a text-style eval and a small modality-specific spot-check suite. The discipline must mature before you take a model upgrade in production — not before launch. The trigger is "we are about to roll forward to a new checkpoint," not "we are about to ship the multimodal feature."
This works because the regression risk is concentrated at upgrade boundaries, not at steady state. Your existing text discipline catches most steady-state issues. The eval discipline you need to build before the first upgrade is specifically the discipline that distinguishes between the old and new model on the modality dimensions your product cares about.
The Architectural Realization
Step back from the eval mechanics. The deeper failure is that multimodal got treated as a feature flag on the existing product surface — same model API, same dashboard, same release gate — when it is actually a different product surface with a different failure distribution. The eval discipline that ignored that distinction is shipping silent regressions every model release.
Three things follow from taking that seriously:
Multimodal capability is not a single number. It's a vector — vision, audio, video, document, chart, table — and your gold set, rubric, and release gate all have to be vector-valued. Compressing it into one quality score destroys the signal that tells you what to roll back.
Provider model upgrades are eval events, not silent infrastructure changes. The provider's release notes won't tell you what regressed on your specific traffic distribution. Your eval pipeline is the only place that finds out. If it can't see per-modality regressions, the upgrade silently changes your product behavior and you find out from users.
The labeling pipeline is critical infrastructure, not a one-time setup. Multimodal gold sets decay. The model improves on the cases you originally chose, distribution shifts as your traffic mix changes, and new edge cases appear (a new chart type, a new document layout, a new audio source). The team that treats labeling as a project rather than an ongoing pipeline ends up with an eval set that confidently reports green on a model that has long since stopped being measured by it.
The tactical work — the rubric per modality, the per-modality gate, the multimodal judge calibration — is the visible part. The harder shift is recognizing that your product now has multiple failure surfaces, each with its own failure rate, each requiring its own evidence, and that the unified quality score that worked for text-only is now a liability. Multimodal eval drift is the cost of not making that shift in time.
- https://github.com/EvolvingLMMs-Lab/lmms-eval
- https://proceedings.neurips.cc/paper_files/paper/2024/file/fe2fc7dc60b55ccd8886220b40fb1f74-Paper-Datasets_and_Benchmarks_Track.pdf
- https://arxiv.org/abs/2306.13394
- https://arxiv.org/html/2411.15296v2
- https://arxiv.org/html/2412.02210v3
- https://openaccess.thecvf.com/content/ICCV2025/papers/Zverev_VGGSounder_Audio-Visual_Evaluations_for_Foundation_Models_ICCV_2025_paper.pdf
- https://pubs.aip.org/asa/jel/article/4/2/025206/3267247/Evaluating-OpenAI-s-Whisper-ASR-Performance
- https://mllm-judge.github.io/
- https://www.patronus.ai/llm-testing/llm-as-a-judge
- https://huggingface.co/openai/whisper-large-v3
- https://github.com/opendatalab/mineru
