Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

· 10 min read
Tian Pan
Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.

Multi-Modal Agents in Production: What Text-Only Evals Never Catch

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same thing three months into production: their eval suite—carefully designed around text inputs and JSON outputs—tells them nothing useful about what happens when the agent encounters a blurry invoice, a scanned contract, or a screenshot of a UI it has never seen. The text-only eval passes. The user files a ticket.

Multi-modal inputs aren't just another modality to wire up. They introduce a distinct category of failure that requires different architecture decisions, different cost models, and different eval strategies. Teams that treat vision as a drop-in addition to a working text agent consistently underestimate the effort involved.

Multimodal AI in Production: The Gap Between Benchmarks and Reality

· 10 min read
Tian Pan
Software Engineer

Most teams adopting multimodal AI make the same mistake: they evaluate models on curated benchmark datasets and assume production performance will track. It doesn't. The gap between a vision model acing MMMU and that same model reliably extracting structured data from your invoices at scale is wide enough to sink a product launch. Vision encoders add latency that benchmark leaderboards don't measure. Spatial reasoning fails on the chart types your users actually send. Audio models that score well on clean speech disintegrate under real-world noise. And the task categories where multimodal genuinely outperforms text-only are narrower than vendors suggest.

This post is a field guide to that gap — where it shows up, why it exists, and which deployment patterns hold up under production load.

The 90% Reliability Wall: Why AI Features Plateau and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Your AI feature ships at 92% accuracy. The team celebrates. Three months later, progress has flatlined — the error rate stopped falling despite more data, more compute, and two model upgrades. Sound familiar?

This is the 90% reliability wall, and it is not a coincidence. It emerges from three converging forces: the exponential cost of marginal accuracy gains, the difference between errors you can eliminate and errors that are structurally unavoidable, and the compound amplification of failure in production environments that benchmarks never capture. Teams that do not understand which force they are fighting will waste quarters trying to solve problems that are not solvable.

On-Call for Stochastic Systems: Why Your AI Runbook Needs a Rewrite

· 10 min read
Tian Pan
Software Engineer

You get paged at 2 AM. Latency is up, error rates are spiking. You SSH in, pull logs, and—nothing. No stack trace pointing to a bad deploy. No null pointer exception on line 247. Just a stream of model outputs that are subtly, unpredictably wrong in ways that only become obvious when you read 50 of them in a row.

This is what incidents look like in LLM-powered systems. And the traditional alert-triage-fix loop was not built for it.

The standard on-call playbook assumes three things: failures are deterministic (same input, same bad output), root cause is locatable (some code changed, some resource exhausted), and rollback is straightforward (revert the deploy, done). None of these hold for stochastic AI systems. The same prompt produces different outputs. Root cause is usually a probability distribution, not a line of code. And you cannot "rollback" a model that a third-party provider updated silently overnight.

The Orchestration Framework Trap: When LangChain Makes You Slower to Ship

· 8 min read
Tian Pan
Software Engineer

At some point in 2024, a pattern started appearing in engineering postmortems across AI teams: "We rewrote it without LangChain and shipping became significantly faster." The teams in these postmortems hadn't made a technical mistake in adopting the framework — they'd made a timing mistake. LangChain was the right tool for the prototype and the wrong tool for month seven.

The same story played out enough times that it has a name now: the orchestration framework trap. You adopt a framework that genuinely accelerates early work, and the productivity gain masks a growing structural debt. By the time the debt is visible, you're deep in internals that were never meant to be touched.

The Over-Tooled Agent Problem: Why More Tools Make Your LLM Dumber

· 9 min read
Tian Pan
Software Engineer

When a team at Writer instrumented their RAG-MCP benchmark, they found that baseline tool selection accuracy — with no special handling — was 13.62% when the agent had access to a large set of tools. Not 80%. Not 60%. Thirteen percent. The same agent, with retrieval-augmented tool selection exposing only the most relevant subset, reached 43%. The tools didn't change. The model didn't change. Only the number of tool definitions visible at reasoning time changed.

This is the over-tooled agent problem, and it's quietly wrecking production AI systems at scale.

The Privacy Architecture of Embeddings: What Your Vector Store Knows About Your Users

· 10 min read
Tian Pan
Software Engineer

Most engineers treat embeddings as safely abstract — a bag of floating-point numbers that can't be reverse-engineered. That assumption is wrong, and the gap between perception and reality is where user data gets exposed.

Recent research achieved over 92% accuracy reconstructing exact token sequences — including full names, health diagnoses, and email addresses — from text embeddings alone, without access to the original encoder model. These aren't theoretical attacks. Transferable inversion techniques work in black-box scenarios where an attacker builds a surrogate model that mimics your embedding API. The attack surface exists whether you're using a proprietary model or an open-source one.

This post covers the three layers of embedding privacy risk: what inversion attacks can actually do, where access control silently breaks down in retrieval pipelines, and the architectural patterns — per-user namespacing, retrieval-time permission filtering, audit logging, and deletion-safe design — that give your users appropriate control over what gets retrieved on their behalf.

The Prompt Governance Problem: Managing Business Logic That Lives Outside Your Codebase

· 9 min read
Tian Pan
Software Engineer

A junior PM edits a customer-facing prompt during a product sprint to "make it sound friendlier." Two weeks later, a backend engineer tweaks the same prompt to fix a formatting quirk. An ML engineer, unaware of either change, adds chain-of-thought instructions in a separate system message that now conflicts with the PM's edit. None of these changes have a ticket. None have a reviewer. None have a rollback plan.

This is how most teams manage prompts. And at five prompts, it's annoying. At fifty, it's a liability.

Prompt Injection Is a Supply Chain Problem, Not an Input Validation Problem

· 9 min read
Tian Pan
Software Engineer

Five carefully crafted documents hidden among a million clean ones can achieve a 90% attack success rate against a production RAG system. Not through zero-days or cryptographic breaks — through plain text that instructs the model to behave differently than its operators intended. If your defense strategy is "sanitize inputs before they reach the LLM," you have already lost.

The framing matters. Teams that treat prompt injection as an input validation problem build perimeter defenses: regex filters, LLM-based classifiers, output scanners. These are useful but insufficient. The real problem is that modern AI systems are compositions of components — retrievers, knowledge bases, tool executors, external APIs — and each component is an ingestion point with its own attack surface. That is the definition of a supply chain vulnerability.

Prompt Localization Debt: The Silent Quality Tiers Hiding in Your Multilingual AI Product

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped with a 91% task success rate. You ran evals, iterated on your prompt, and tuned it until it hit your quality bar. Then you launched globally — and three months later a user in Tokyo files a support ticket that your AI "doesn't really understand" their input. Your Japanese users have been silently working around a feature that performs 15–20 percentage points worse than what your English users experience. Nobody on your team noticed because nobody was measuring it.

This is prompt localization debt: the accumulating gap between how well your AI performs in the language you built it for and every other language your users speak. It doesn't announce itself in dashboards. It doesn't cause outages. It just quietly creates second-class users.

Red-Teaming Consumer LLM Features: Finding Injection Surfaces Before Your Users Do

· 9 min read
Tian Pan
Software Engineer

A dealership deployed a ChatGPT-powered chatbot. Within days, a user instructed it to agree with anything they said, then offered $1 for a 2024 SUV. The chatbot accepted. The dealer pulled it offline. This wasn't a sophisticated attack — it was a three-sentence prompt from someone who wanted to see what would happen.

At consumer scale, that curiosity is your biggest security threat. Internal LLM agents operate inside controlled environments with curated inputs and trusted data. Consumer-facing LLM features operate in adversarial conditions by default: millions of users, many actively probing for weaknesses, and a stochastic model that has no concept of "this user seems hostile." The security posture these two environments require is fundamentally different, and teams that treat consumer features like internal tooling find out the hard way.