Skip to main content

Why Your Prompt Library Should Be a Monorepo, Not a Cookbook

· 11 min read
Tian Pan
Software Engineer

A team I worked with recently had three different "summarize this contract" prompts. One lived in a Notion page that the legal-tech squad copy-pasted into their service. One lived in a prompts/ folder in the customer-success backend, slightly modified to handle their tone preferences. One lived inline in a Python file inside the data team's notebook, hardcoded between two f-string interpolations. When OpenAI deprecated the model they all ran on, the migration plan involved Slack archaeology — each owner had to be tracked down, each variant had to be re-evaluated, and two of the three subtly broke in production for a week before anyone noticed.

This is what a prompt cookbook looks like at scale. Cookbooks make sense for ten prompts and one team. They become unmanageable somewhere around a hundred prompts and four teams. By the time you're running an AI organization, your prompts/ folder of .md files behaves exactly like vendored copy-paste code from 2008: every consumer has its own snapshot, drift is invisible, and breaking changes ripple outward in unpredictable ways.

The fix isn't a fancier folder structure. It's recognizing that prompts are code — and applying the engineering disciplines we already developed for code: semantic versioning, dependency graphs, atomic cross-package refactors, and build-time validation. In other words, your prompt library should be a monorepo, not a cookbook.

The Cookbook Failure Mode

Most teams start by writing prompts directly into application code. A method calls client.messages.create(prompt=textwrap.dedent("...")) with a few hundred tokens inline. This works until the second team needs the same capability. Then the prompt gets copy-pasted, edits diverge, and within months you've forked the same instruction into four different versions that nobody can keep straight.

The next stage of evolution is a prompts/ folder. This feels like progress — at least the prompt has a home. But a folder of markdown files is a cookbook, not a library. There's no notion of which version of the prompt the production service is using. There's no way to ask "who consumes this prompt, and what would break if I changed it?" There's no automated check that the prompt still works after an edit. The folder is just a shared scratchpad with worse ergonomics than the inline string it replaced.

Three failure modes show up reliably once a cookbook crosses about fifty prompts:

  • Silent drift. Someone edits a prompt for one consumer's edge case, and a different consumer that depended on the original behavior degrades silently. By the time someone notices, the commit history has thirty more changes layered on top.
  • Phantom dependencies. A prompt gets deleted because nobody on the platform team knows which service still imports it. A week later, a customer-facing flow starts returning empty strings because the file it loaded at startup is gone.
  • Migration paralysis. When the underlying model deprecates, you can't move forward because nobody knows what production actually depends on. Every prompt has to be hand-traced to its callers and individually re-evaluated against unknown success criteria.

Each of these is a classic vendored-code problem. We solved them in the source-code world by building dependency graphs, version pins, and atomic refactor tooling. Prompts deserve the same treatment.

What "Monorepo Discipline" Actually Means for Prompts

Calling something a monorepo doesn't mean putting all your code in one Git repository. The Linux kernel has been one repo for decades and that's not what people mean by "monorepo discipline." The discipline is a set of properties that emerge when you have one shared history, one coherent dependency graph, and tooling that understands both.

Translated to prompts, the properties look like this:

  • Pinned versions. Every consumer references a specific, immutable version of a prompt. When you edit, you cut a new version. The old version still exists. Nothing in production silently rebuilds against unreviewed changes.
  • Semantic boundaries. Each prompt has one purpose, one owner, and one set of expected inputs and outputs. Multi-purpose prompts that serve four different use cases get split into siblings rather than accumulating conditional branches.
  • Reverse-dependency search. From any prompt file, you can answer "who calls this?" in seconds. From any service, you can answer "which prompt versions does this depend on?" without grep archaeology.
  • Atomic cross-consumer refactors. Renaming a prompt, changing its output schema, or migrating it to a new model is one PR that updates the prompt and every consumer simultaneously. No staged migrations, no compatibility shims, no week-long Slack threads.
  • Build-time validation. Every PR that modifies a prompt runs the consumer's eval suite against the new version before merge. A regression below the threshold blocks the merge automatically.
  • Downstream impact analysis. When you modify a prompt, CI tells you which consumers are affected and runs their evals — not just the prompt's own tests.

None of these properties are exotic. They're table stakes for shared library code. The only thing novel is applying them to text artifacts that historically lived in markdown files.

Semantic Versioning for Prompts

The most concrete starting point is treating prompt edits the way you treat library releases. Semantic versioning maps cleanly to prompts:

  • Patch (Z): Typo fixes, formatting tweaks, clarifications that don't change behavior on the eval set. Consumers can upgrade automatically.
  • Minor (Y): New capabilities or instructions that maintain backward compatibility. Existing consumer outputs stay valid; new behavior is additive. Consumers can upgrade after a quick eval pass.
  • Major (X): Output schema changes, model swaps, structural rewrites, or anything that breaks existing parsing logic downstream. Consumers must explicitly opt in and re-test.

This isn't bureaucracy for its own sake. It maps onto a real distinction your CI can enforce: a patch release is allowed if the eval suite passes within tolerance; a minor release requires the suite to pass and adds new test cases; a major release requires the consumer to update their pinned version and rerun their own evals.

A subtle but important corollary: the version is not just the prompt text. Two prompts with identical text but different model targets, different temperatures, or different tool definitions are different versions. The execution context matters because the same string against claude-opus-4-7 and claude-haiku-4-5 produces measurably different behavior. Pin the whole context, not just the words.

Dependency Graphs and Atomic Refactors

The single most underrated benefit of monorepo discipline is the ability to make a change everywhere in one commit. If a shared utility's signature changes, the utility and every caller update in the same PR. There's no period during which some callers are migrated and others aren't, no sequenced merges, no compatibility dance.

Prompts benefit from this even more than code does, because prompt changes are harder to verify than type-checked function signatures. Imagine you decide that summarize_contract.v3.md should now return JSON with a new risk_level field instead of a free-text summary. In a cookbook world, you edit the file, four consumers' parsers break in production over the next two days, and you spend the rest of the week fixing them serially. In a monorepo world, the PR that introduces v3 also updates every consumer's parser to handle the new schema, runs every consumer's eval suite, and only merges if every consumer passes.

This requires real tooling. A flat prompts/ folder doesn't know that customer_success/onboarding_email.py imports prompts/welcome_message.md. You need either explicit imports (treat each prompt as a module with a stable identifier and have consumers reference it by ID + version) or a build-time resolver that traces references. Both work; the key is that the dependency edges are machine-readable rather than buried in open(...) calls.

A useful test of whether you have this property: can you delete a prompt and have CI fail at any consumer that still references it? If yes, you have a dependency graph. If no, you have a folder.

Build-Time Validation: Evals as the Compiler

Type checkers catch a class of programming errors at build time that runtime tests would miss. Prompts have an analogous validation tier — eval suites that run on every PR — but most teams treat evals as something you do occasionally, in a notebook, after a major change.

This is upside-down. Evals should run on every prompt-touching PR, the same way tests run on every code-touching PR. The prevailing 2025 pattern is concrete: a developer opens a PR with a modified prompt, GitHub Actions kicks off a workflow that runs the new version against a golden dataset, and the workflow blocks the merge if scores fall below the production baseline. Tools like Promptfoo and LLM-as-a-judge frameworks have made this cheap enough that there's no excuse for skipping it.

A few details that separate teams who do this well from teams who do it badly:

  • Version the eval set with the prompt. A prompt that passed yesterday's eval set isn't the same as a prompt that passed today's. If the eval set drifts independently of the prompt, you'll quietly migrate to easier tests over time.
  • Set explicit thresholds, not vibes. "Looks fine" is not a merge gate. "Factuality ≥ 0.85, no individual case below 0.6" is.
  • Run consumer evals, not just prompt evals. A prompt change that passes its own eval but breaks downstream parsing is still a regression.
  • Track score history. A 2% drop on every PR is invisible in isolation and catastrophic over six months. Your eval results need a time series, not a pass/fail.

The compiler analogy holds in another way: a type checker's value isn't that it catches every bug; it's that it makes a class of bugs structurally impossible. Eval gates work the same way. They make "we shipped a prompt change without checking" structurally impossible.

Migrating From a Cookbook

You don't need to rewrite everything at once. The migration that tends to work has a specific shape:

  1. Inventory. Grep for every prompt-shaped string in your codebase. Most teams discover 2–3× more prompts than they thought existed. Many are duplicates.
  2. De-duplicate aggressively. Three prompts that do the same thing should become one prompt with three consumers. This is the first place real monorepo benefits show up — pull the duplication forward, not later.
  3. Pin and version. Move each surviving prompt into a registry with explicit IDs. Update consumers to reference the ID + version, not the file path. Even a flat directory plus a version.lock file is a step up.
  4. Wire up reverse-dependency search. Build (or buy) tooling that, given a prompt ID, can list every consumer. This is non-negotiable for the next steps.
  5. Add eval gates per consumer. Each consumer brings its own golden dataset and threshold. The prompt repository runs them automatically when its prompts change.
  6. Make atomic refactors the default. Stop accepting PRs that change a prompt without updating consumers. Stop accepting PRs that change a consumer's parsing logic without updating the prompt's eval set.

The endpoint isn't a particular tool — MLflow's prompt registry, LangSmith, Braintrust, Promptfoo, or a homegrown system all work. The endpoint is the discipline: prompts are first-class modules with versions, owners, dependency graphs, and CI gates. The folder of markdown files was a placeholder. The monorepo is the real thing.

What Changes Once You Have It

The point of all this isn't tooling for tooling's sake. It's that a properly versioned prompt library lets your AI organization move at engineering speed instead of archaeology speed.

Model migrations stop being multi-week incidents. When the next deprecation hits, you grep your registry, see exactly which prompts and consumers are affected, run the eval suite against the new model, and ship the migration as a single PR with green CI. The work that used to take three teams two sprints takes one engineer one afternoon.

Cross-team reuse becomes real. A great extraction prompt that legal built can actually be picked up by HR without legal needing to support a fork. The version pin protects them both — legal can iterate on its consumer's behalf without breaking HR's, and HR can opt into improvements on its own schedule.

And the failure mode I opened with — three subtly different versions of the same prompt drifting silently — stops happening at all. Not because anyone is paying more attention, but because the dependency graph and the CI gates make it impossible to ship the divergence in the first place. That's the point of engineering discipline: the system catches the mistakes you would have made.

A folder of markdown files works for a hobby project. The moment your prompts are production infrastructure, treat them like production infrastructure.

References:Let's stay in touch and Follow me for more thoughts and updates