Skip to main content

3 posts tagged with "vision-models"

View all tags

Adding a Modality Is a Privacy-Classification Event, Not a Feature Flag

· 11 min read
Tian Pan
Software Engineer

A product manager pings the AI team on a Tuesday: "Customers want to paste screenshots into the support agent. Should be a small lift, right? The model already takes images." The eng lead checks the SDK, confirms the vision endpoint accepts JPEGs and PNGs, ships the change behind a feature flag, and rolls it to ten percent. Two weeks later, the legal team forwards a regulator letter asking why a user's bank statement, an image of their driver's license, and a screenshot containing another customer's order ID all appeared in the agent's training-eligible logs. Nobody on the AI team flagged the modality change, because nobody thought a modality change was a change. The privacy review that approved the text agent never re-ran for the image variant — and the image variant turned out to live under entirely different consent, retention, and residency rules.

This is not a story about a careless engineer. It is a story about a category error built into how most teams ship AI features. Text input is a known data class with a stable threat model: the user types, the user sees what they typed, the engineering team has years of habit around what to log and what to drop. Images are a different data class with a different threat model — they smuggle in metadata the user cannot see, capture surrounding content the user did not intend to share, and create storage and processing footprints with their own residency and contract terms. Treating "now with vision" as a UX iteration, when it is actually a privacy-classification event, is how teams discover at the regulator's request that their PII inventory understated their actual exposure by an order of magnitude.

Multimodal AI in Production: The Gap Between Benchmarks and Reality

· 10 min read
Tian Pan
Software Engineer

Most teams adopting multimodal AI make the same mistake: they evaluate models on curated benchmark datasets and assume production performance will track. It doesn't. The gap between a vision model acing MMMU and that same model reliably extracting structured data from your invoices at scale is wide enough to sink a product launch. Vision encoders add latency that benchmark leaderboards don't measure. Spatial reasoning fails on the chart types your users actually send. Audio models that score well on clean speech disintegrate under real-world noise. And the task categories where multimodal genuinely outperforms text-only are narrower than vendors suggest.

This post is a field guide to that gap — where it shows up, why it exists, and which deployment patterns hold up under production load.

Why Vision Models Ace Benchmarks but Fail on Your Enterprise PDFs

· 9 min read
Tian Pan
Software Engineer

A benchmark result of 97% accuracy on a document understanding dataset looks compelling until you run it against your company's actual invoice archive and realize it's quietly garbling 30% of the line items. The model doesn't throw an error. It doesn't return low confidence. It just produces output that looks plausible and is wrong.

This is the defining failure mode of production document AI: silent corruption. Unlike a crash or an exception, silent corruption propagates. The garbled table cell flows into the downstream aggregation, the aggregation feeds a report, the report drives a decision. By the time you notice, tracing the root cause is archaeology.

The gap between benchmark performance and production performance in document AI is real, persistent, and poorly understood by teams evaluating these models. Understanding why it exists — and how to defend against it — is the engineering problem this post addresses.