Skip to main content

Multi-Tenant AI Systems: Isolation, Customization, and Cost Attribution at Scale

· 10 min read
Tian Pan
Software Engineer

Most teams building SaaS products on top of LLMs discover the multi-tenancy problem the hard way: they ship fast using a single shared prompt config, then watch in horror as one customer's system prompt leaks into another's response, one enterprise client burns through everyone's rate limit, or the monthly AI bill arrives with no way to determine which customer caused 40% of the spend. The failure mode isn't theoretical—a 2025 paper at NDSS demonstrated that prefix caching in vLLM, SGLang, LightLLM, and DeepSpeed could be exploited to reconstruct another tenant's prompt with 99% accuracy using nothing more than timing signals and crafted requests.

Building multi-tenant AI infrastructure is not the same as multi-tenanting a traditional database. The shared components—inference servers, KV caches, embedding pipelines, retrieval indexes—each present distinct isolation challenges. This post covers the four problems you actually have to solve: isolation, customization, cost attribution, and per-tenant quality tracking.

The Isolation Problem Is Worse Than You Think

In traditional web applications, tenant isolation is mostly a database schema decision. In AI systems, isolation must be enforced at every layer of the request pipeline: authentication, tenant resolution, context compilation, inference, and caching.

The correct order matters. Authentication must happen before tenant resolution; tenant resolution before session derivation; session loading before context compilation. Skip or reorder any step and you risk mixing contexts. This sounds obvious until you're debugging an agent framework where middleware runs in a different order than you expected, and one enterprise customer occasionally receives another's session history.

The nastiest isolation failure mode is in prefix caching. Modern inference servers cache key-value attention states for common prompt prefixes to avoid recomputing them on every request. When two tenants happen to share a prompt prefix—a common system prompt template, for instance—their KV cache blocks can coexist in the same GPU memory pool. Without proper isolation, an adversary can reconstruct the victim tenant's prompt by issuing requests that probe cache hits through timing.

The fix, added to vLLM in 2025, is cache_salt: a per-request parameter that salts the block hash so two requests with identical text produce different cache keys. This costs you cross-tenant cache reuse, but that trade-off is almost always correct. The alternative is accepting a data breach vector by default.

Data isolation for RAG systems is simpler but still requires deliberate design. The standard pattern is one vector database namespace (or separate collection) per tenant, with the tenant ID enforced at query time, not just at insert time. If your retrieval layer ever accepts a tenant ID as a user-supplied parameter without re-validating it against the authenticated session, you have a horizontal privilege escalation vulnerability.

Customization Without Rebuilding Per Customer

The appeal of shared infrastructure is cost. The appeal of per-tenant customization is product differentiation. These goals pull in opposite directions, but the industry has converged on a few patterns that let you have both.

System prompt and guardrail configuration is the lowest-friction form of customization. Each tenant gets a configuration record that specifies their system prompt template, enabled guardrails, and any overrides. At request time, the gateway merges the tenant config with the incoming request. Frameworks like LiteLLM and NVIDIA NeMo Guardrails both support this pattern natively. The configuration store is cheap; the base model is shared.

Model routing is the next tier. Some tenants pay for GPT-4 class models; others use a smaller, cheaper model. Some need low-latency streaming; others can tolerate batch processing. A per-tenant routing table maps tenant tier to model choice. This is operationally straightforward but introduces pricing complexity: your cost attribution needs to track not just token counts but token counts per model per tenant.

Per-tenant fine-tuning via LoRA adapters is where it gets interesting. Rather than training a separate model for each customer, you train a small adapter (typically tens of megabytes) on customer-specific data and load it on top of a shared base model at inference time. Frameworks like LoRAX can serve thousands of LoRA adapters on a single GPU through dynamic adapter loading and Punica CUDA kernels that batch adapter computations efficiently. AWS SageMaker supports the same pattern with adapter-per-request routing. The result: per-tenant model behavior with near-shared-model infrastructure costs.

Rate limiting deserves its own mention because teams consistently get the enforcement point wrong. Checking token budgets after an LLM call does nothing—the tokens are already spent. Budget enforcement must happen at the entry point before dispatching to inference, with hard limits that reject or queue requests when a tenant is at ceiling. Checking only at the outer edge of an agent isn't enough either; if the agent makes multiple LLM calls per user turn, you need per-call accounting deeper in the execution path.

Cost Attribution: The Billing Problem Nobody Plans For

Token counting is easy. Per-tenant cost attribution at scale is not. The gap between the two is where most teams lose months of engineering time.

The minimal viable approach: tag every inference request with tenant_id, feature_id, and model_version before it leaves your gateway, then stream the usage metadata (input tokens, output tokens, model) to a cost tracking store. The aggregation queries—monthly spend by tenant, cost per feature, model distribution across customer tiers—are trivial once the metadata is there. They're impossible to reconstruct retroactively if you forgot to tag.

Caching complicates attribution in two ways. First, shared prefix caches mean some tenants benefit from cache hits they didn't pay to warm. Whether to credit the tenant who warmed the cache or amortize the savings across all beneficiaries is an accounting decision, not a technical one—but you need to decide before your billing system is live. Second, tenant-specific caches (separate namespaces per tenant in an external KV cache like LMCache) eliminate the ambiguity but cost more memory.

Model routing multiplies the complexity: a single request from tenant A might fan out to an embedding model, a reranker, and a generation model, each with different per-token rates. Cost attribution requires tracking every hop in that chain, not just the final generation call. AI gateways like Cloudflare AI Gateway and OpenRouter provide request-level telemetry across providers, which helps when your model stack spans multiple vendors.

Budget controls are the operational safety net. The practical pattern: set a per-tenant soft limit at 80% of budget with an alert to the account team, and a hard limit at 100% that either queues non-urgent requests or rejects them with a clear error. Teams that skip the hard limit discover during month-end close that one misbehaving agent ran overnight and tripled a customer's bill.

Per-Tenant Quality Tracking

This is the least mature area, and the most commonly neglected until something breaks in production.

In a single-tenant system, your eval suite covers the whole product. In a multi-tenant system, a model update or prompt change that improves average quality can still degrade quality for specific tenants—particularly those with non-standard language, domain-specific terminology, or highly tuned system prompts. Aggregate metrics hide per-tenant regressions.

The minimum investment: shadow-run eval sets per tenant on staging before any prompt or model change ships. The eval sets don't need to be large—a few hundred examples per tenant covering their actual use patterns is sufficient for catching regressions. The infrastructure cost is modest; the operational overhead of maintaining per-tenant eval datasets is real but necessary once you have enterprise customers.

Latency tracking should also be per-tenant, not just per-endpoint. Different tenant configurations have different latency profiles: a tenant with a long system prompt plus RAG retrieval plus guardrails has a fundamentally different latency floor than one making direct completion calls. Alerting on average latency across all tenants will miss a tenant-specific degradation until they escalate a support ticket.

The other signal worth instrumenting is behavioral: retry rate, session abandonment, and edit rate per tenant. These are leading indicators of quality problems that surface before explicit feedback does. If tenant A's retry rate climbs 15% after a model change, something in their specific configuration doesn't work as well with the new model, even if no other signal has fired yet.

The Noisy Neighbor Problem in Inference

Unlike a web server where a compute-heavy tenant slows request processing, in AI infrastructure a runaway tenant can degrade GPU memory, exhaust KV cache capacity, and back up the scheduling queue for everyone else. LLM inference workloads are particularly spiky: a single tenant running a document processing batch job can consume all available decode capacity and push other tenants' requests into multi-second queue waits.

Defense strategies, in order of implementation cost:

Rate limiting at the gateway is the minimum. Per-tenant request rate limits and concurrent request limits prevent one customer from monopolizing the scheduling queue. Set limits per tenant tier, not just globally.

Resource quotas per tenant require infrastructure-level support: container resource limits, connection pool quotas, and if you're using vLLM's disaggregated serving (prefill and decode on separate servers), per-tenant scheduling priority. Disaggregated serving helps because prefill-heavy requests—long prompts, document processing—are isolated from decode-heavy requests and don't starve time-to-first-token for interactive tenants.

Dedicated compute for premium tenants is the exit hatch. Azure OpenAI's Provisioned Throughput Units, Together AI's inference endpoints, and AWS Bedrock's provisioned models all follow the same model: a tenant pays for reserved capacity that's physically isolated from shared pools. This eliminates noisy neighbor problems but eliminates shared infrastructure cost savings too. It's the right choice for enterprise SLAs, not for the long tail.

Putting It Together

The architecture that emerges from these constraints looks like this: a gateway that enforces authentication, tenant resolution, rate limiting, and budget checks before any request reaches inference; a per-tenant configuration store that drives model routing, system prompt injection, and guardrail selection; a shared base model with optional per-tenant LoRA adapters loaded dynamically at serving time; prefix caching with cache salting to preserve throughput without cross-tenant cache collisions; and a telemetry pipeline that tags every operation with tenant metadata for cost attribution and quality tracking.

None of this is novel infrastructure. The components exist in open-source (vLLM, LiteLLM, LoRAX, LMCache) and in managed services (Azure PTUs, AWS SageMaker multi-adapter serving). The engineering work is integration and operations, not invention.

What's actually hard is the organizational discipline: ensuring every new feature that touches inference goes through the same tenant-aware gateway, that cost attribution tagging never gets skipped in a time-crunch deploy, and that per-tenant eval sets are updated when a customer's use patterns change. Multi-tenant AI systems fail not because the infrastructure primitives are missing, but because teams add them piecemeal in response to incidents rather than building them into the foundation from the start.

References:Let's stay in touch and Follow me for more thoughts and updates