Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

May 10, 2026 · 11 min read

Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

The interesting part is not that disagreement happens. The interesting part is that the model's output looks the same whether the channels agreed or disagreed: same fluent prose, same confident tone, same calibration scores on the surface. The signal that something went wrong is buried in attention patterns the application layer never sees, and the eval suite that approved the deployment was built from clean image-text pairs where the channels were aligned by construction. The team is shipping a system whose worst answers are systematically invisible to the test that approved it.

The architecture is a fusion problem, not a single input/output box

The "multimodal" abstraction in most application code is a black box that takes pixels and tokens and returns tokens. The internal reality is that vision tokens and text tokens travel through partially separate processing paths and are reconciled by attention layers near the end of the network. Different architectures place the fusion at different depths — early-fusion designs project image patches into the same embedding space as text early; late-fusion designs run two encoders and merge their decisions; hybrid designs interleave them — but every design has a place where two streams that disagree must produce one answer.

When the streams agree, fusion is invisible. When they disagree, fusion is the bug. The model has to produce something, and what it produces is a weighted blend that smooths the conflict away. There is no token in the output for "the visual channel and the textual channel reported different things and I picked one." There is just an answer, and the answer reads as if no conflict existed.

A 2026 study on radiology vision-language models gave this failure mode a more rigorous handle: when image-embedded text is OCR-readable, the OCR pathway can dominate the pixel pathway and override visual evidence, even under stealth conditions that evade human inspection. The team that approved the deployment was looking at the model's output. The mechanism that produced the output was a modality competition the application layer had no API for.

Where it surfaces in production

The cleanest examples come from systems that route around the failure as if it were a vision bug or a text bug, never naming it as a fusion bug.

Document AI. The printed value in a financial table reads $1,240,000. The bar in the chart adjacent to that table is sized to roughly $1.6M. The model summarizes the document and reports a single number — usually one of the two, occasionally a smoothed compromise — without flagging that the table and the chart disagree about what is true. Recent benchmarks confirm what document teams already suspected: multimodal LLMs perform measurably better on table-evidence than on equivalent chart-evidence, and small models show weak correlation between the two formats, indicating the channels are not generalizing across each other. The user gets a fluent summary that is silently wrong about whichever channel the model preferred.

Screenshot agents. The button on the page renders the label "Submit." The DOM accessibility tree, exposed to the model alongside the screenshot, names the button "Cancel." (This happens more often than it should — stale aria-labels, A/B-test JavaScript that swaps text without touching the accessibility name, third-party widgets with inconsistent props.) The agent clicks based on whichever channel won the fusion. The downstream trace shows "navigation succeeded, action performed" because the click landed on a real element. The user finds out their cart was emptied. The compounding-state-error pattern documented in screenshot-driven agents — once the agent enters the wrong UI state, every subsequent perception is built on a false assumption — turns a single fusion miss into a multi-step incident.

Defect inspection. A photographed part shows visible cracking. A QA stamp in the corner of the photograph reads "PASSED." The vision-language pipeline reports the part as conforming, with high confidence. This is the modality-dominance failure in its purest commercial form: the OCR pathway carrying institutional authority (a stamp) overrides the pixel pathway carrying physical reality (a defect). The team running the eval suite did not include adversarial cases because their training distribution did not include them either; the stamps in the dataset agreed with the parts.

User-uploaded content. A screenshot of a chart is uploaded with a caption claiming the chart shows revenue growth. The chart shows a decline. The model summarizing the upload produces a description that splits the difference — "the chart depicts revenue trends over time" — without acknowledging the caption contradicts the image. In moderation pipelines this is the failure mode behind "the model didn't catch it"; it is not that the model couldn't see, it is that the model was given two stories and produced a third.

The standard multimodal eval is built from image-text pairs scraped or curated from sources where the text describes the image accurately. COCO captions, VQA datasets, document-screenshot pairs from web crawls — these are aligned distributions by construction. The failure mode requires misaligned distributions, where the channels carry different claims, and that distribution is rare enough in the training corpus that the model has no calibrated behavior for it. Even worse: the model learned that channels agree, so when they disagree, the model resolves the conflict using whatever priors it has, which are usually wrong for the deployed domain.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

The architecture is a fusion problem, not a single input/output box

Where it surfaces in production

Why your eval set is blind to it

Recommended Reading

About Tian Pan

The architecture is a fusion problem, not a single input/output box​

Where it surfaces in production​

Why your eval set is blind to it​

Recommended Reading

About Tian Pan

The architecture is a fusion problem, not a single input/output box

Where it surfaces in production

Why your eval set is blind to it