Skip to main content

8 posts tagged with "multimodal"

View all tags

Multi-Modal Agents in Production: What Text-Only Evals Never Catch

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.

Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.

Multimodal AI in Production: The Gap Between Benchmarks and Reality

· 10 min read
Tian Pan
Software Engineer

Most teams adopting multimodal AI make the same mistake: they evaluate models on curated benchmark datasets and assume production performance will track. It doesn't. The gap between a vision model acing MMMU and that same model reliably extracting structured data from your invoices at scale is wide enough to sink a product launch. Vision encoders add latency that benchmark leaderboards don't measure. Spatial reasoning fails on the chart types your users actually send. Audio models that score well on clean speech disintegrate under real-world noise. And the task categories where multimodal genuinely outperforms text-only are narrower than vendors suggest.

This post is a field guide to that gap — where it shows up, why it exists, and which deployment patterns hold up under production load.

The Transcript Layer Lie: Why Your Multimodal Pipeline Hallucinates Downstream

· 9 min read
Tian Pan
Software Engineer

Your ASR system returned "the patient takes metaformin twice daily." The correct word was metformin. The transcript looked clean — no [INAUDIBLE] markers, no error flags. Confidence was 0.73 on that word. Your pipeline discarded that number and handed clean text to the LLM. The LLM, treating it as ground truth, reasoned about a medication that doesn't exist.

This is the transcript layer lie: the implicit assumption that intermediate text representations — whether produced by speech recognition, OCR, or vision models parsing a document — are reliable enough to pass downstream without qualification. They aren't. But almost every production pipeline treats them as if they are.

Multimodal Pipelines in Production: What Breaks When You Go Beyond Text

· 11 min read
Tian Pan
Software Engineer

Most LLM engineering wisdom — caching prompts, tuning temperature, budgeting tokens — assumes text goes in and text comes out. Add an image, a PDF, or an audio clip and almost none of that wisdom transfers. The preprocessing is different. The failure modes are different. The cost model is different. And the eval suite you built for your text pipeline won't catch the new things that break.

About 50% of enterprise knowledge lives in non-text formats: PDFs, slides, scanned forms, product images. Teams that reach that data discover that going multimodal isn't just adding a modality — it's adding an entirely new engineering surface.

Multimodal RAG in Production: When You Need to Search Images, Audio, and Text Together

· 12 min read
Tian Pan
Software Engineer

Most teams add multimodal RAG to their roadmap after realizing that a meaningful chunk of their corpus — product screenshots, recorded demos, architecture diagrams, support call recordings — is invisible to their text-only retrieval system. What surprises them in production is not the embedding model selection or the vector database choice. It's the gap between modalities: the same semantic concept encoded as an image and as a sentence lands in completely different regions of the vector space, and the search engine has no idea they're related.

This post covers the technical mechanics of multimodal embedding alignment, the cross-modal reranking strategies that actually work at scale, the cost and latency profile relative to text-only RAG, and the failure modes that are specific to multimodal retrieval.

Vision Inputs in Production AI Pipelines: The Preprocessing Decisions Nobody Documents

· 10 min read
Tian Pan
Software Engineer

Your vision model benchmarks 90%+ on your eval suite. Then real users upload photos of physical documents, screenshots from low-DPI monitors, and scanned PDFs that have been round-tripped through three fax machines. Accuracy craters. The model "works" — it returns coherent responses — but the responses are wrong in ways that are hard to catch without knowing the ground truth. You file it under "model limitations" and move on.

The model probably isn't the problem. The input pipeline is.

Most teams building with vision LLMs spend enormous effort on prompt engineering and model selection, and nearly zero effort on the preprocessing that happens before the image ever reaches the model. That asymmetry is where production quality goes to die. The preprocessing decisions nobody documents are also the ones responsible for the biggest silent accuracy drops in production multimodal systems.

Multimodal LLMs in Production: The Cost Math Nobody Runs Upfront

· 11 min read
Tian Pan
Software Engineer

Most teams add multimodal capabilities to an existing LLM pipeline without running the cost math first. They prototype with a few test images, it works, they ship — and then the first billing cycle arrives. The number is somewhere between embarrassing and catastrophic, depending on volume.

The problem isn't that multimodal AI is expensive in principle. It's that each modality has a distinct token arithmetic that compounds in ways that text-only intuition doesn't prepare you for. A single configuration parameter — video frame rate, image resolution mode, whether you're re-sending a system prompt every turn — can silently multiply your inference bill by 10x or more before you've noticed anything is wrong.

Multimodal LLM Inputs in Production: Vision, Documents, and the Failure Modes Nobody Warns You About

· 9 min read
Tian Pan
Software Engineer

Adding vision to an LLM application looks deceptively simple. You swap a text model for a multimodal one, pass in an image alongside your prompt, and the demo works brilliantly. Then you push to production and discover that half your invoices get the total wrong, tables in PDFs lose their structure, and low-quality scans produce confident hallucinations. The debugging is harder than anything you faced with text-only systems, because the failures are visual and the LLM will not tell you it cannot see clearly.

This post covers what actually goes wrong when you move multimodal LLM inputs from prototype to production, and the architectural decisions that prevent those failures.