Skip to main content

Per-Tenant Prompt Compilation: When Your System Prompt Becomes a Build Artifact

· 10 min read
Tian Pan
Software Engineer

The day a multi-tenant SaaS team adds the third if tenant_industry == "healthcare" branch to its system prompt is the day it accidentally hires itself a compiler engineer. Nobody filed the headcount req. Nobody scoped the work. The team thinks it is shipping a feature; it is actually shipping a build system, and the build system is held together with f-strings.

Every team that scales an AI feature into a customer base with even mild heterogeneity hits the same wall. Tenant A is in healthcare and needs HIPAA-aware response framing. Tenant B is in legal and needs strict citation discipline. Tenant C is an enterprise that bought a custom safety rubric in the master agreement. Tenant D is on the free tier and gets the default. The first instinct is to handle the variance with runtime conditionals, and the conditionals nest until the prompt becomes unreadable to anyone who didn't write it. The second instinct — and the one most teams arrive at after the wall — is prompt compilation: the canonical "prompt" is no longer a string but a source artifact, and what reaches the model is a compiled output.

The shift from string to source artifact is small in code and enormous in operational consequence. You have built a compiler. You now own everything a compiler owns: source-level diffs that don't map cleanly to compiled diffs, a build cache that has to invalidate correctly when an upstream template changes, a debugging story where a customer reports a bad output and you have to reconstruct the exact compiled prompt that produced it. Most teams price the compilation feature and forget to price the compiler.

The wall: conditionals don't scale past three tenants

The first per-tenant variation is harmless. The second is fine. The third starts to bend the prompt around itself, and by the fifth the system prompt is a wall of nested conditionals that nobody can audit. The pattern shows up identically across every team I've watched grow into it: someone adds a tenant-specific paragraph behind an if, someone else adds a flag for a feature toggle, the safety team adds an industry-specific disclaimer, and the prompt grows a personality disorder.

The reason this approach breaks down is not aesthetic. It is operational. Each conditional is a branch in the prompt's runtime behavior, and the eval set has to cover every branch combination to give you confidence the changes don't regress. When the prompt has eight independent conditions, your eval coverage problem is combinatorial, not linear. Most teams quietly stop evaluating the rare branches and discover six months later that the healthcare tenant has been receiving the legal tenant's citation discipline because nobody hit that branch in pre-production.

Compilation is the structural answer. Instead of one prompt with branches, you have templated sections, tenant-specific overlays, capability-flagged includes, and A/B-test branches that get composed at config-deploy time into a tenant-specific final prompt. The compiled artifact is a flat string per tenant. The eval surface becomes the matrix of compiled outputs, not the cross-product of source-level branches.

What you actually built: a compiler with all the obligations

The compilation pipeline solves the conditional explosion. It also inherits the full ergonomic burden of a real compiler, and most teams don't notice the inheritance until something breaks.

Source-to-binary debugging. A customer reports that the assistant said something wrong. The engineer pulls the prompt source and can't reproduce the failure because the source isn't what ran. The runtime ran a compiled artifact derived from the source plus the tenant's config plus the feature flags active at the time plus the compilation cache state. Reproducing the failure means reconstructing all of that. If you don't store the compiled artifact with the trace, you are debugging a Rust binary by reading the source files and guessing at the LLVM passes.

Build cache invalidation. When a shared base template changes, every tenant whose compiled prompt depends on that template needs to be recompiled and re-evaluated. If the cache key is wrong, you ship a stale prompt to one tenant while every other tenant gets the new one — and you discover this in a customer escalation, not in CI. Cache invalidation is famously one of the two hard problems in computer science, and you just inherited it for prompts.

Per-tenant evaluation surface. The eval set has to grade compiled outputs, not source templates, because the source is meaningless without the compilation context. This means the eval pipeline needs a per-tenant fixture: for each tenant you support, run the eval set against that tenant's compiled prompt, with that tenant's expected behaviors. A team supporting forty tenants now has forty eval suites that all have to stay green, and a change to a shared template ripples into forty re-validations.

Versioning is now two-dimensional. A request's exact behavior depends on the template version and the tenant-config version that were composed together. If your trace only records one of those, you can answer half of the "why did this happen" questions. Most teams instrument for code version because that's the muscle memory from regular software, and they discover the gap when legal asks them to reproduce a specific tenant's behavior from three weeks ago.

The unpriced cost: how the bill actually arrives

The team that treats per-tenant variance as "just config" doesn't get the bill on day one. The bill arrives at month six, and it arrives in three forms.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates