Skip to main content

12 posts tagged with "multi-tenancy"

View all tags

Per-Tenant Prompt Compilation: When Your System Prompt Becomes a Build Artifact

· 10 min read
Tian Pan
Software Engineer

The day a multi-tenant SaaS team adds the third if tenant_industry == "healthcare" branch to its system prompt is the day it accidentally hires itself a compiler engineer. Nobody filed the headcount req. Nobody scoped the work. The team thinks it is shipping a feature; it is actually shipping a build system, and the build system is held together with f-strings.

Every team that scales an AI feature into a customer base with even mild heterogeneity hits the same wall. Tenant A is in healthcare and needs HIPAA-aware response framing. Tenant B is in legal and needs strict citation discipline. Tenant C is an enterprise that bought a custom safety rubric in the master agreement. Tenant D is on the free tier and gets the default. The first instinct is to handle the variance with runtime conditionals, and the conditionals nest until the prompt becomes unreadable to anyone who didn't write it. The second instinct — and the one most teams arrive at after the wall — is prompt compilation: the canonical "prompt" is no longer a string but a source artifact, and what reaches the model is a compiled output.

Tenancy Leaks Through Few-Shot Examples: When Your Prompt Library Becomes a Cross-Customer Data Store

· 11 min read
Tian Pan
Software Engineer

Open the production system prompt of a maturing AI product, scroll past the role description, and you will almost always find a section labeled # Examples or ## Few-shot demonstrations. The examples are excellent — they are concrete, they are domain-specific, they pattern-match exactly the failure modes the eval set was struggling with last quarter. They are also, on closer inspection, real customer data. A real ticket ID from a real account. A phrasing pattern lifted verbatim from a support thread. An internal product code that one tenant uses and the rest of the customer base has never heard of.

The team that put them there is not careless. The examples got into the prompt the way good examples always get into prompts: someone mined production traces for cases the model handled poorly, picked the cleanest worked example, pasted it into the system message, watched the eval scores climb, and shipped. That pipeline — production trace to system prompt — is the most reliable prompt-improvement loop in modern LLM engineering. It is also a structural cross-tenant data leak that the team built without noticing, and the system prompt has quietly become a multi-tenant data store the data-processing agreement never priced.

Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts

· 11 min read
Tian Pan
Software Engineer

Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.

The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.

The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR

· 10 min read
Tian Pan
Software Engineer

A tool labeled "stateless" is a promise the runtime cannot keep. Behind the function signature sits a Redis cache, a vector index, an embedding store, a rate-limit table, a memoization layer, an LRU on the hot path — any one of which is a shared substrate where one user's data can land on another user's response. The function is stateless. The system is not. And in 2026, this is the most common privacy bug I see in agentic systems, because almost no one tests for it.

The shape of the bug is depressingly familiar to anyone who has worked on classic web apps. Insecure Direct Object Reference — IDOR — was the bread and butter of bug bounty for a decade: a request handler that accepts a record ID and returns the record without checking whether the caller is allowed to see it. The AI-era version is the same bug with a worse blast radius: a tool call that accepts a query and returns data without checking whether the caller's tenant owns that data. The query is in natural language. The cache key is a hash. The retrieval is approximate. None of those things absolve you of authorization, but each of them makes the bug harder to spot in code review.

Prompt Cache Thrashing: When Your Largest Tenant's Launch Triples Everyone's Bill

· 10 min read
Tian Pan
Software Engineer

The bill arrives on the first of the month and it is three times what your spreadsheet said it would be. Nobody pushed a system prompt change. The dashboard says request volume is flat. p95 latency looks normal. The token-per-correct-task ratio is unchanged. And yet you owe the inference vendor an extra forty thousand dollars, and the only signal in the observability stack that even hints at why is a metric most teams never alarm on: cache hit rate, which dropped from 71% to 18% somewhere in the second week of the billing cycle, on a Tuesday, at 9:47 AM Pacific, which is when your largest tenant's customer-success team kicked off a coordinated onboarding push for two hundred new users.

Welcome to prompt cache thrashing — the multi-tenant failure mode that the SaaS playbook was supposed to have eliminated a decade ago, reintroduced through the back door by your inference provider's shared prefix cache. The provider's cache is shared across your organization's traffic. Your tenants share that cache with each other whether you want them to or not, and a single tenant whose prefix shape shifts overnight can evict the prefixes everyone else's unit economics depended on. The bill spikes for tenants who did nothing differently. Finance pages engineering. Engineering points at the dashboard, which shows nothing wrong, because the dashboard isn't measuring the thing that broke.

Per-Tenant Inference Isolation: When Shared Cache, Fine-Tunes, and Embeddings Leak Across Customers

· 12 min read
Tian Pan
Software Engineer

Multi-tenant SaaS solved data isolation a decade ago. Row-level security in Postgres, per-tenant encryption keys, S3 bucket policies scoped to tenant prefixes — by 2018 the playbook was so well-rehearsed that an auditor asking "show me how customer A's data cannot reach customer B" had a one-page answer with a citation per layer. AI features quietly reintroduced the question and the answer is no longer one page.

The interesting part is not that AI broke isolation. The interesting part is where it broke isolation: not at the data layer the audit team has been guarding for ten years, but at four new layers nobody put on the diagram. Prompt cache prefixes share KV state across requests in ways that turn time-to-first-token into a side channel. Fine-tunes trained on aggregated customer data memorize tenant-specific phrasing and surface it back to the wrong customer. Embedding indexes get partitioned logically by query filter when the threat model demands physical separation. KV-cache reuse across requests creates timing channels that nobody threat-modeled when "shared inference is fine" was a reasonable shortcut.

This post is about what changed and what the discipline looks like once you take the problem seriously.

Your P99 Is Following a Stranger's Traffic: The Noisy-Neighbor Tax in Hosted LLM Inference

· 10 min read
Tian Pan
Software Engineer

Your dashboards are clean. The deployment from yesterday rolled back cleanly. The model version is pinned. The prompt didn't change. But your TTFT p99 just doubled, your customer success channel is on fire, and the only honest answer you can give is "it's the provider." That answer feels small — like a shrug — and it usually leads to a follow-up question that nobody on your team can answer: prove it.

This is the part of hosted LLM inference that the marketing pages do not discuss. When you call a frontier model API, you are sharing a GPU, a PCIe fabric, a continuous batch, and a KV-cache budget with workloads you cannot see. Your p99 is a function of their bursts. The economics of large-scale inference depend on multiplexing tenants tightly enough that hardware utilization stays north of 60-70%, which means your tail latency is structurally coupled to the largest, jankiest, lumpiest tenant on the same shard. You are not buying capacity; you are buying a slice of a queue that someone else is also standing in.

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

· 10 min read
Tian Pan
Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.

The Multi-Tenant LLM Problem: Noisy Neighbors, Isolation, and Fairness at Scale

· 12 min read
Tian Pan
Software Engineer

Your SaaS product launches with ten design customers. Everything works beautifully. Then you onboard a hundred tenants, and one of them — a power user running 200K-token context windows on a complex research workflow — causes every other customer's latency to spike. Support tickets start arriving. You look at your dashboards and see nothing obviously wrong: your model is healthy, your API returns 200s, and your p50 latency looks fine. Your p95 has silently tripled.

This is the noisy neighbor problem, and it hits LLM infrastructure harder than almost any other shared system. Here's why it's harder to solve than it is in databases — and the patterns that actually work.

Vector Store Access Control: The Row-Level Security Problem Most RAG Teams Skip

· 11 min read
Tian Pan
Software Engineer

Most teams building multi-tenant RAG systems get authentication right and authorization wrong. They validate that users are who they claim to be, then retrieve documents from a shared vector index and filter the results before sending them to the LLM. That filter—the post-retrieval kind—is security theater. By the time you remove unauthorized documents from the list, they're already in the model's context window.

The real problem runs deeper than a misplaced filter. Most RAG systems treat document authorization as an ingest-time concern ("can this user upload this document?") but fail entirely to enforce it at query time ("can this user see documents matching this query?"). The gap between those two checkpoints is where silent data leakage lives—and it's where most production incidents originate.

The Noisy Neighbor Problem in Shared LLM Infrastructure: Tenancy Models for AI Features

· 12 min read
Tian Pan
Software Engineer

The pager goes off at 2:47 AM. The customer-facing chat assistant is returning 429s for half of paying users. Engineers scramble through dashboards, looking for the bug they shipped that afternoon. They find nothing — the code is fine. The actual culprit is a batch summarization job a different team launched that evening, sharing the same provider API key, which has eaten the account's per-minute token budget for the next four hours. Nobody owns the shared key. Nobody owns the limit.

This is the noisy-neighbor problem, and it has a particular cruelty in LLM systems that classic API quota incidents do not. A REST endpoint that hits its rate ceiling fails fast and gets retried; an LLM token-per-minute bucket is consumed asymmetrically by request content, so a single feature emitting 8K-token completions can starve a feature making cheap 200-token classification calls without ever appearing in request-count graphs. The traffic isn't noisy in the dimension you're measuring.

Most teams discover this the way the team above did: an unrelated team's job collides with a paying user's session, and the only thing both have in common is a string in an environment variable.

Cross-Tenant Data Leakage in Shared LLM Infrastructure: The Isolation Failures Nobody Tests For

· 11 min read
Tian Pan
Software Engineer

Most multi-tenant LLM products have a security gap that their engineers haven't tested for. Not a theoretical gap — a practical one, with documented attack vectors and real confirmed incidents. The gap is this: each layer of the modern AI stack introduces its own isolation primitive, and each one can fail silently in ways that let one customer's data reach another customer's context.

This isn't about prompt injection or jailbreaking. It's about the infrastructure itself — prompt caches, vector indexes, memory stores, and fine-tuning pipelines — and the organizational fiction of "isolation" that most teams ship without validating.