Skip to main content

13 posts tagged with "multimodal"

View all tags

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

· 11 min read
Tian Pan
Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

Prompt Injection in Multimodal Inputs: The Visual Attack Surface Your Text-Only Defense Misses

· 11 min read
Tian Pan
Software Engineer

When teams harden their AI pipelines against prompt injection, they usually focus on text: sanitizing user input strings, scanning outputs for exfiltrated data, filtering known jailbreak patterns. That work matters, but it addresses roughly half the attack surface of a modern AI system. The other half lives inside images, PDFs, audio clips, and charts — formats that bypass every text-scanning rule you've written, because the model processes them through entirely different pathways than it processes text.

Steganographic injection attacks against vision-language models achieve success rates around 24% across production models including GPT-4V, Claude, and LLaVA. That number isn't a lab artifact. It measures real attack payloads, hidden in ordinary-looking images, causing production models to deviate from their intended behavior. Your text injection scanner doesn't see any of it.

Adding a Modality Is a Privacy-Classification Event, Not a Feature Flag

· 11 min read
Tian Pan
Software Engineer

A product manager pings the AI team on a Tuesday: "Customers want to paste screenshots into the support agent. Should be a small lift, right? The model already takes images." The eng lead checks the SDK, confirms the vision endpoint accepts JPEGs and PNGs, ships the change behind a feature flag, and rolls it to ten percent. Two weeks later, the legal team forwards a regulator letter asking why a user's bank statement, an image of their driver's license, and a screenshot containing another customer's order ID all appeared in the agent's training-eligible logs. Nobody on the AI team flagged the modality change, because nobody thought a modality change was a change. The privacy review that approved the text agent never re-ran for the image variant — and the image variant turned out to live under entirely different consent, retention, and residency rules.

This is not a story about a careless engineer. It is a story about a category error built into how most teams ship AI features. Text input is a known data class with a stable threat model: the user types, the user sees what they typed, the engineering team has years of habit around what to log and what to drop. Images are a different data class with a different threat model — they smuggle in metadata the user cannot see, capture surrounding content the user did not intend to share, and create storage and processing footprints with their own residency and contract terms. Treating "now with vision" as a UX iteration, when it is actually a privacy-classification event, is how teams discover at the regulator's request that their PII inventory understated their actual exposure by an order of magnitude.

The Avatar in the Conference Call: Engineering Real-Time Talking-Head AI for Video Meetings

· 12 min read
Tian Pan
Software Engineer

A voice agent with a face is not a voice agent with a face. It is a synchronous-video-AI system, and the difference shows up the first time a human watches the lips drift three frames behind the audio and decides — without being able to articulate why — that the thing on the screen is fake. The voice-only teams that built a 300ms speech pipeline and then bolted a rendering model onto the end of it have just inherited a real-time multimodal problem they did not price into the roadmap.

The threshold is not generous. Below roughly 45ms of audio-video offset, viewers report perfect sync. Past about 125ms with audio leading or 45ms with audio lagging, the brain flags the mismatch as wrong even when the viewer cannot point to the cause. Inside a conversational loop where the avatar must also listen, think, speak, and render — all while a network sits between you and the user — there is no slack to absorb a sloppy seam between the audio output and the rendered face.

Multimodal Eval Drift: Why Your Image and Audio Paths Regress While Text Stays Green

· 11 min read
Tian Pan
Software Engineer

The dashboard says quality is up two points this release. The text-eval suite ran clean. Your model provider shipped a new checkpoint that beats the prior one on every public benchmark you track. You roll forward. A week later the support team flags a quiet but persistent uptick in tickets about uploaded screenshots — users say the model is "reading the wrong numbers from the chart" or "missing a row in the table." Audio transcription complaints follow a few days later, mostly from non-American English speakers. None of it shows up in your eval pipeline. The release looks healthy. It isn't.

This is multimodal eval drift, and almost every team that bolted vision and audio onto a text-first stack is shipping it. The eval discipline that worked for text — gold sets, LLM-as-judge, drift dashboards, an aggregate score that gates the release — extends to multimodal in name only. The failure rates per modality are not commensurable, the rubrics that catch text errors don't catch image errors, and the labeling pipeline that produced your text gold set is calibrated to a workload that ships every six months, not to a multimodal regression that arrives with every checkpoint update.

The right mental model is that multimodality is not a flag on the same model — it is a different product surface with a different failure distribution, and the eval discipline that ignored that distinction is shipping silent regressions every model release.

Multi-Modal Agents in Production: What Text-Only Evals Never Catch

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.

Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.

Multimodal AI in Production: The Gap Between Benchmarks and Reality

· 10 min read
Tian Pan
Software Engineer

Most teams adopting multimodal AI make the same mistake: they evaluate models on curated benchmark datasets and assume production performance will track. It doesn't. The gap between a vision model acing MMMU and that same model reliably extracting structured data from your invoices at scale is wide enough to sink a product launch. Vision encoders add latency that benchmark leaderboards don't measure. Spatial reasoning fails on the chart types your users actually send. Audio models that score well on clean speech disintegrate under real-world noise. And the task categories where multimodal genuinely outperforms text-only are narrower than vendors suggest.

This post is a field guide to that gap — where it shows up, why it exists, and which deployment patterns hold up under production load.

The Transcript Layer Lie: Why Your Multimodal Pipeline Hallucinates Downstream

· 9 min read
Tian Pan
Software Engineer

Your ASR system returned "the patient takes metaformin twice daily." The correct word was metformin. The transcript looked clean — no [INAUDIBLE] markers, no error flags. Confidence was 0.73 on that word. Your pipeline discarded that number and handed clean text to the LLM. The LLM, treating it as ground truth, reasoned about a medication that doesn't exist.

This is the transcript layer lie: the implicit assumption that intermediate text representations — whether produced by speech recognition, OCR, or vision models parsing a document — are reliable enough to pass downstream without qualification. They aren't. But almost every production pipeline treats them as if they are.

Multimodal Pipelines in Production: What Breaks When You Go Beyond Text

· 11 min read
Tian Pan
Software Engineer

Most LLM engineering wisdom — caching prompts, tuning temperature, budgeting tokens — assumes text goes in and text comes out. Add an image, a PDF, or an audio clip and almost none of that wisdom transfers. The preprocessing is different. The failure modes are different. The cost model is different. And the eval suite you built for your text pipeline won't catch the new things that break.

About 50% of enterprise knowledge lives in non-text formats: PDFs, slides, scanned forms, product images. Teams that reach that data discover that going multimodal isn't just adding a modality — it's adding an entirely new engineering surface.

Multimodal RAG in Production: When You Need to Search Images, Audio, and Text Together

· 12 min read
Tian Pan
Software Engineer

Most teams add multimodal RAG to their roadmap after realizing that a meaningful chunk of their corpus — product screenshots, recorded demos, architecture diagrams, support call recordings — is invisible to their text-only retrieval system. What surprises them in production is not the embedding model selection or the vector database choice. It's the gap between modalities: the same semantic concept encoded as an image and as a sentence lands in completely different regions of the vector space, and the search engine has no idea they're related.

This post covers the technical mechanics of multimodal embedding alignment, the cross-modal reranking strategies that actually work at scale, the cost and latency profile relative to text-only RAG, and the failure modes that are specific to multimodal retrieval.

Vision Inputs in Production AI Pipelines: The Preprocessing Decisions Nobody Documents

· 10 min read
Tian Pan
Software Engineer

Your vision model benchmarks 90%+ on your eval suite. Then real users upload photos of physical documents, screenshots from low-DPI monitors, and scanned PDFs that have been round-tripped through three fax machines. Accuracy craters. The model "works" — it returns coherent responses — but the responses are wrong in ways that are hard to catch without knowing the ground truth. You file it under "model limitations" and move on.

The model probably isn't the problem. The input pipeline is.

Most teams building with vision LLMs spend enormous effort on prompt engineering and model selection, and nearly zero effort on the preprocessing that happens before the image ever reaches the model. That asymmetry is where production quality goes to die. The preprocessing decisions nobody documents are also the ones responsible for the biggest silent accuracy drops in production multimodal systems.

Multimodal LLMs in Production: The Cost Math Nobody Runs Upfront

· 11 min read
Tian Pan
Software Engineer

Most teams add multimodal capabilities to an existing LLM pipeline without running the cost math first. They prototype with a few test images, it works, they ship — and then the first billing cycle arrives. The number is somewhere between embarrassing and catastrophic, depending on volume.

The problem isn't that multimodal AI is expensive in principle. It's that each modality has a distinct token arithmetic that compounds in ways that text-only intuition doesn't prepare you for. A single configuration parameter — video frame rate, image resolution mode, whether you're re-sending a system prompt every turn — can silently multiply your inference bill by 10x or more before you've noticed anything is wrong.