Skip to main content

System Prompts as Code, Config, or Data: The Architecture Decision That Cascades Into Everything

· 12 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a customer-support agent with the system prompt living in a Postgres row, one row per tenant. The pitch was sensible: enterprise customers had asked for tone customization, and "make the prompt editable" was the cheapest way to deliver it. Six months later, three things had happened. The eval suite had ballooned from 200 cases to 11,000 because every tenant's prompt now needed its own regression set. The prompt-update workflow had quietly become a write path with no review, because product owners had been given direct access to the table. And a single broken UTF-8 character in a Korean-language tenant prompt had taken that tenant's chatbot offline for two days before anyone noticed, because the deploy pipeline had no idea the prompt had changed.

None of these outcomes were forced by the requirements. They were forced by an architecture decision that nobody made deliberately: where does the system prompt live? In the code? In a config file? In a database row? The team picked "database" because it was the fastest path to a feature, and the consequences cascaded into every adjacent system over the following months.

This is the most under-considered architecture decision in early-stage AI products. It looks like a deployment trivia — "where do we put this string?" — but the answer determines who can change the prompt, what review process applies, how rollbacks work, what your eval matrix looks like, and whether your platform team or your product team owns reliability when something breaks. The decision is also expensive to reverse. A six-month migration from "in-database" back to "in-code" is not unusual. The teams that get this right pick deliberately, before MVP, with a framework. The teams that get it wrong pick by accident and pay the bill for years.

The three storage classes and what each one is actually for

Before you can pick, you have to see all three options clearly. They are not interchangeable.

Prompt as code means the prompt string lives in your application repository, gets reviewed in pull requests, ships in your deploy artifact, and is bound to a specific application version. A change requires a CI run, a code review, and a deploy. The prompt and the code that consumes it share a release cycle. Rollback is git revert. The cost of a change is high — minutes to hours — and the audit trail is automatic because Git already records who changed what and why.

Prompt as config means the prompt lives in a config artifact that ships independently of code: a JSON or YAML file in a config service, a remote-config entry in LaunchDarkly or Statsig, a row in a "prompts" table that the application reloads on a short TTL, or a CMS entry that product owns. The application reads the prompt at runtime — sometimes per-request, sometimes cached. Changes do not require a deploy. The cost of a change is low — seconds to minutes — and the review process depends entirely on what tooling you wrap around the config store. Without explicit discipline, there is no review.

Prompt as data means the prompt is per-row, per-tenant, per-user, or per-conversation, stored alongside other tenant or user state in your operational database. The prompt is not a single global string with versions; it is a multi-valued field where each tenant has their own. Changes happen on the customer's clock, not on yours. The eval problem multiplies by the number of distinct prompt instances you carry. The audit trail depends on whatever auditing your data layer already does, which is usually less than you would want.

These map to different mental models. Code is for invariants. Config is for parameters that vary across environments and time. Data is for things that vary across the entities your product serves. A system prompt can plausibly be any of the three, and that ambiguity is exactly why teams pick wrong.

The hidden costs of "prompt as code"

Putting the prompt in the repo is the right default for most pre-product-market-fit AI products, and it is also the option that engineering teams pick reflexively without examining whether their product actually wants it. The benefits are real: every change is reviewed, every change is auditable, every change rides the same CI pipeline as the code that depends on it, and rollback is the same operation you already do for code regressions. The shared release cycle means you can refactor the prompt and the parser that consumes its structured output in the same PR, with a single eval run validating the pair.

The costs show up later, in two predictable shapes. The first is iteration latency. When your prompt-engineering team needs to test five small wording variations and your deploy pipeline takes 25 minutes per change, you have just made the inner loop of prompt iteration glacial. Teams in this situation usually invent a side channel — a feature flag that overrides the in-code prompt, a "prompt scratchpad" environment that bypasses CI — and the side channel quietly becomes the production path while the in-code prompt becomes a stale fallback that nobody updates.

The second cost is the wall between engineering and the people who actually understand what the prompt should say. Domain experts and product owners cannot ship a prompt change without finding an engineer to PR it. This is fine when prompt changes are weekly. It is corrosive when they are daily. The right move when this wall starts to bind is to deliberately move the prompt to config — not to invent shadow tooling that pretends the prompt is still in code.

Why "prompt as config" looks free and isn't

Moving the prompt into a config service is the most popular escape hatch when the in-code workflow gets too slow. Hosted prompt-management tools like Langfuse and PromptLayer exist precisely to make this transition smooth: they give you Git-style versioning, labels for staging vs. production, and an API your application reads. The pitch is "code-like discipline without the deploy cycle," and at small scale, that pitch is real.

The trap is that the discipline is opt-in, and most teams never actually opt in. Versioning is automatic, but environment labels still need a deploy gate that calls your eval suite before promoting staging to production. Reviews still need a code-review-equivalent workflow on prompt changes. Rollback still needs to be tested. None of these come free with the tool. They are processes you have to build, and you have to build them on top of a system that is now decoupled from your application's release pipeline.

The other failure mode is more subtle. When prompts ship independently of code, the contract between the prompt and the parser that consumes its output silently drifts. A prompt-engineer tweaks the system prompt to start producing JSON with a new field; the parser deployed in production does not know about that field; nothing breaks loudly, but the new field is silently ignored and a downstream feature regresses. The detection lag on this kind of drift is days to weeks, because the symptom is "users complain that feature X feels less useful" rather than "exception in logs."

The mitigation is contract tests that run on every prompt-config change and validate that the prompt's output still matches the parser's expectations. Most teams do not build these until after they have been bitten. The two-line summary: config-stored prompts are a good architecture once you have the discipline; until then, they are a fast path to silent regressions.

The eval-matrix explosion and why "prompt as data" is rarely what you want

Per-tenant prompts are seductive because the customer ask is real. Enterprise buyers want their assistant to sound like their brand, refuse to discuss competitors, escalate using their internal terminology. The cheapest interpretation of "make the prompt editable" is a textarea in an admin panel that writes to a database row. Six months later, you have N tenants, N prompts, N eval suites that nobody runs, and N latent incidents waiting for the right input to trigger them.

The architectural mistake is treating prompt customization as a string-replacement problem when it is actually a configuration-API problem. A tenant prompt-as-data system gives every tenant the full power of the system prompt — meaning they can also break refusal calibration, override safety instructions, or paste in something that contradicts the tool descriptions you ship with the agent. You have just shipped a public-facing prompt-injection vector with the customer as the attacker, and the customer's eval responsibilities are now yours.

The pattern that works instead is structured customization: a small set of bounded fields (tone, brand name, refusal sensitivity, domain glossary) that get interpolated into a master prompt that you control. The master prompt lives in code or config; the per-tenant fields live in data. The eval matrix is now over the bounded customization fields, not over arbitrary tenant prompts. This is harder to sell to enterprise buyers because it sounds less flexible, but it is the only version of multi-tenant prompts that survives contact with security review and regulated industries.

There is one legitimate case for true prompt-as-data: developer-platform products where the prompt itself is the product. If you are selling a "build your own agent" tool, your customer's prompt is data by definition — it is what they bought. In that case, the eval-matrix explosion is the customer's problem, not yours, and you should architect the platform to make that explicit.

A decision framework you should apply before MVP

Pick the storage class that matches your answers to four questions, in this order.

First, how often will the prompt change relative to your deploy cadence? If less often, prompt-as-code. If more often by a factor of 10 or more, prompt-as-config. The 10x threshold matters because below it, the friction of code review is not the bottleneck; above it, it dominates everything else.

Second, who needs to be able to change the prompt without an engineer? If only engineers, code. If product owners or domain experts, config with a review workflow built around the config store. If end customers or tenants, you do not want raw prompts — you want bounded customization fields, with the master prompt staying in code or config.

Third, what happens if a prompt change ships at 3am with no review? If the answer is "nothing, we'll catch it tomorrow," config is fine. If the answer is "regulated industry, we need an audit trail and a compliance officer's signoff," config is fine but only with explicit review tooling — and you should be honest that you will need to build that tooling, because it does not come free with prompt-management SaaS. If the answer involves the words "kill switch," consider whether prompt-as-code is just simpler.

Fourth, what is the eval scope you can afford to maintain? A single global prompt has one eval suite. A handful of A/B variants has a small multiple. Per-environment config (staging vs. production) doubles the matrix at most. Per-tenant prompts multiply by N, where N is the number of tenants and grows monotonically. If you cannot commit to running N eval suites, do not architect for N prompts.

The migration cost when you realize you picked wrong

The unpleasant fact is that all three migrations are expensive in their own way, and the cost is asymmetric.

Migrating from code to config is the cheapest of the three. You stand up the config store, mirror the in-code prompt into it, ship a release where the application reads from config with the in-code value as fallback, and gradually cut over. The hard part is not the migration; it is building the review and eval discipline around the new config store, which most teams underinvest in for six months.

Migrating from config to code is rare but happens — usually when a team realizes their config-stored prompts have drifted untested for months and they want to re-impose CI gates. The migration is mechanical; the real cost is renegotiating with the product owners who got used to fast iteration and now have to file PRs again.

Migrating from data to code-or-config is the painful one. You have to look at every tenant prompt, classify it (legitimate customization vs. accidental complexity vs. injection vector), design a structured-customization API that captures the legitimate cases, write a migration that maps the unstructured prompts onto the structured fields, and tell the customers whose prompts do not map that you are taking away functionality they were using. This is a multi-quarter project, and it is the migration teams put off for years because every quarter the cost goes up.

The takeaway is not that one of these storage classes is correct in the abstract. It is that the choice should be made on purpose, with the four questions above answered explicitly, before the first version of the product ships. The teams that get this right are not the ones who pick the most flexible option; they are the ones who pick the option that matches the constraints of the product they are actually building, and who revisit the choice when those constraints change.

References:Let's stay in touch and Follow me for more thoughts and updates