Multimodal Eval Drift: Why Your Image and Audio Paths Regress While Text Stays Green
The dashboard says quality is up two points this release. The text-eval suite ran clean. Your model provider shipped a new checkpoint that beats the prior one on every public benchmark you track. You roll forward. A week later the support team flags a quiet but persistent uptick in tickets about uploaded screenshots — users say the model is "reading the wrong numbers from the chart" or "missing a row in the table." Audio transcription complaints follow a few days later, mostly from non-American English speakers. None of it shows up in your eval pipeline. The release looks healthy. It isn't.
This is multimodal eval drift, and almost every team that bolted vision and audio onto a text-first stack is shipping it. The eval discipline that worked for text — gold sets, LLM-as-judge, drift dashboards, an aggregate score that gates the release — extends to multimodal in name only. The failure rates per modality are not commensurable, the rubrics that catch text errors don't catch image errors, and the labeling pipeline that produced your text gold set is calibrated to a workload that ships every six months, not to a multimodal regression that arrives with every checkpoint update.
The right mental model is that multimodality is not a flag on the same model — it is a different product surface with a different failure distribution, and the eval discipline that ignored that distinction is shipping silent regressions every model release.
Why Text Evals Are Blind to Vision and Audio Regressions
Text evaluation has converged on a fairly stable set of techniques: a curated gold set with reference outputs, a judge prompt that scores responses on relevance, factuality, and tone, and a dashboard that tracks aggregate quality across releases. That stack works because text failures cluster around facts you can verify and behaviors you can describe in a rubric.
The vision path doesn't share that property. A model that hallucinates fewer text facts can simultaneously misread chart numbers more often. The two error rates are weakly correlated — improvements in language modeling don't automatically transfer to visual grounding, and provider-side post-training emphasis can shift between releases without disclosure. Recent OCR and chart benchmarks (CC-OCR, ChartQA, DocVQA) consistently show that the same checkpoint can lead on summarization tasks while regressing on table reading or numeric extraction from charts. If your eval rolls those into a single quality number, you can't see it move.
Image regressions also tend to express themselves as confidence shifts rather than factuality shifts. The model still produces a confident answer. The answer is wrong because the model misidentified the cell, miscounted the bars, or merged two adjacent columns. Your text rubric — "is the answer factually correct" — handles that case poorly because there's no easy ground truth without visual annotation. LLM-as-judge over text-only outputs makes this worse: the judge sees the same wrong answer your model produced, has no access to the image, and rates it as plausible.
Audio is its own surface. Whisper-class transcription degrades along axes that text inputs never have: accent (multiple peer-reviewed studies have measured uneven word-error rate across regional accents and demographic groups), codec (VoIP and WebRTC compression chains add distortion that academic benchmarks underrepresent), and ambient noise. An ASR model upgrade can improve clean-audio benchmarks while regressing on the actual distribution your users send — phone calls, voice memos, meeting recordings — because the public benchmarks are clean, and your traffic is not. Downstream, the LLM that consumes the transcript inherits every transcription error as a false fact, and the resulting wrong answer looks like a generation failure.
The Failure Mode Most Teams Hit First
The pattern repeats across teams. You launch the text product. Quality discipline matures around text — eval coverage, regression gates, an LLM judge with a written rubric. Then the multimodal feature ships as the second feature on the same stack. The product team adds image upload to the chat interface in a sprint. The eval team adds twenty image cases to the existing gold set, the LLM judge gets a one-line addition that "evaluate vision responses too," and the release pipeline keeps reporting a single aggregate quality score.
Three months later a model upgrade comes out. The text suite passes. The aggregate quality is up. Image-related support tickets quietly climb 30%. The eval team has no signal because:
- The image cases are a rounding error in the gold set. Twenty of two thousand is one percent. A 50% regression on those twenty looks like a 0.5% drop in aggregate, well inside noise.
- The LLM judge can't grade image outputs without seeing the image. Most judge harnesses pass the model's response to a text-only judge. The judge has no access to the original image and rates the response on its prose, not its grounding.
- The rubric items don't fit. "Was the chart axis labeled correctly?" is not in the text rubric. "Was the table column aligned?" isn't either. The judge defaults to scoring fluency.
By the time you trace the support tickets back to a vision regression, the model has been live for two weeks and the rollback path is messy because other product changes shipped on top of it.
The Discipline That Has to Land
The fix is not "add a vision metric to the dashboard." It is a structural change in how you evaluate multimodal.
- https://github.com/EvolvingLMMs-Lab/lmms-eval
- https://proceedings.neurips.cc/paper_files/paper/2024/file/fe2fc7dc60b55ccd8886220b40fb1f74-Paper-Datasets_and_Benchmarks_Track.pdf
- https://arxiv.org/abs/2306.13394
- https://arxiv.org/html/2411.15296v2
- https://arxiv.org/html/2412.02210v3
- https://openaccess.thecvf.com/content/ICCV2025/papers/Zverev_VGGSounder_Audio-Visual_Evaluations_for_Foundation_Models_ICCV_2025_paper.pdf
- https://pubs.aip.org/asa/jel/article/4/2/025206/3267247/Evaluating-OpenAI-s-Whisper-ASR-Performance
- https://mllm-judge.github.io/
- https://www.patronus.ai/llm-testing/llm-as-a-judge
- https://huggingface.co/openai/whisper-large-v3
- https://github.com/opendatalab/mineru
