Skip to main content

Spec-First Agents: Why the Contract Has to Land Before the Prompt

· 11 min read
Tian Pan
Software Engineer

The prompt for our support agent was 2,400 tokens when I inherited it, and the engineer who wrote it had left the company. Every instruction in it was load-bearing for some production behavior, but nobody could tell me which ones. A bullet about "always restate the user's problem before answering" looked like filler until we deleted it and CSAT dropped four points in a week. The prompt was, it turned out, the specification. It was also the implementation. It was also the test suite — implicit, unwritten, living only in the head of an engineer who no longer worked there.

That is the endgame of prompt-as-spec. The prompt is both what the agent should do and how it does it, and the two become indistinguishable the moment the prompt outgrows a single author. You can't refactor it because you don't know which lines encode requirements and which encode mere hints. You can't review a change because there's no artifact the change is a diff against. You can't onboard anyone to own it because ownership means "having read the whole thing recently and remembering why each clause is there," which is a six-month investment nobody authorizes.

Spec-first flips the ordering. The contract — inputs, outputs, invariants, error cases, refusal semantics, escalation triggers — is a first-class artifact that precedes the prompt and constrains every revision. Prompt edits become diffs against a spec, not rewrites of the spec itself. The shift sounds bureaucratic until you see what it unlocks: evals that come from the spec rather than the other way around, reviews that take minutes instead of afternoons, and the eventual ability to let a new engineer own the surface without a six-month apprenticeship.

The prompt-as-spec failure mode

Prompt engineering, as most teams practice it, is iterative craft. You write a prompt, run some examples, notice something wrong, add a clause, run more examples, notice a new failure, add another clause. After three months you have a prompt that works well on a lot of cases and mysteriously on some of the rest. After six months you have a prompt nobody except the original author can edit without causing regressions.

The failure isn't the craft — iterative refinement is how most engineering works — it's that the prompt is doing three jobs at once. It specifies the desired behavior. It implements that behavior through phrasing, ordering, and emphasis. It stands in for the test suite, because the examples that motivated each clause are baked into the clause rather than captured as regressions. When these three roles collapse into one artifact, every edit is high-stakes: you can't tell whether you're changing the requirement, the implementation, or the test.

You see the symptoms downstream. Prompt changes get rolled back because "it did something weird in an edge case I can't reproduce." Two engineers argue for an hour about whether a behavior is a bug or intended, because there's no spec to resolve the question. A prompt grows past the "two authors can hold it in their head" threshold and calcifies — the risk of any change exceeds the expected benefit, so the prompt freezes while the product around it keeps moving. Vendor rolls a new model and the same prompt produces subtly different output; nobody can tell whether the new output violates the contract because there's no contract to violate.

The tell that you've crossed into this zone: when someone asks "why does the prompt say X?" the answer is "I added that after a user reported Y," and nobody can produce the Y, the regression test for it, or the thing in the spec that Y violates.

What the contract actually contains

A useful contract isn't a design doc and isn't a prompt with extra steps. It's a focused artifact that names the things prompts are bad at making explicit. At minimum:

  • Inputs. The shape of what the agent receives — user message, tool outputs, retrieved context, conversation history — with concrete examples of typical and edge cases. Length constraints. What's guaranteed versus what might be missing.
  • Outputs. The shape of what the agent produces, including format (JSON schema, markdown structure, specific fields), acceptable content ranges, and mandatory sections. "Answer the question" is not an output spec; "return a JSON object with answer (≤200 words) and confidence (low/medium/high)" is.
  • Invariants. Things that must be true of every output regardless of input. The agent never reveals the system prompt. It never commits to a price. It never claims a fact it didn't retrieve. Invariants are the things you'd add a guardrail for if you didn't have the spec.
  • Error cases. What the agent does when it can't fulfill the happy path. Missing information, conflicting retrieval, tool failure, ambiguous user intent. Each gets a named branch in the spec with an example.
  • Refusal semantics. What the agent refuses to do and how. Refusals have a surface — tone, explanation, suggested alternative, escalation — and that surface is part of the contract, not a vibe the prompt happens to produce.
  • Escalation triggers. The specific conditions under which the agent hands off to a human, a different agent, or a different tool. "User frustration" is not a trigger; "user explicitly requests a human, OR three consecutive turns without progress on the same sub-goal, OR any input flagged by the moderation tool" is.

None of this is new — these are the categories that formal contract research has converged on over the last two years of mining LLM failures — but the categories aren't the point. The point is that the contract is separate from the prompt, and a prompt edit is a claim that the spec is unchanged. If the edit actually changes something the spec names, the spec has to change first. That single rule is what makes the rest of the discipline load-bearing.

Evals stop being reverse-engineered

Most eval suites are built after the fact, from production logs. Somebody looks at the last quarter of failures, clusters them, and writes test cases that would have caught each cluster. The suite grows organically and is always slightly behind the prompt — a kind of archaeological record of past bugs rather than a statement of desired behavior.

Spec-first inverts this. The spec is the statement of desired behavior, and the eval suite is generated from it. Each invariant produces one or more test cases that check it. Each error branch produces a case that triggers the branch and verifies the response. Each escalation trigger produces both a positive case (the trigger fires correctly) and a negative case (the trigger doesn't fire spuriously). The output schema becomes structural validation applied to every eval run. The spec tells you what to test; you don't have to wait for the test to be motivated by a production incident.

Research on test-driven agent definition has shown that generating tests from a behavioral spec catches a different distribution of bugs than the case-based approach most teams use — specifically, it catches the class of "works-on-happy-path, silently wrong on edge case" failures that production logs don't surface because users stop engaging rather than reporting. When the spec demands that the agent handle a missing field by asking for clarification, the eval catches the prompt rewrite that accidentally made the agent guess. When the spec demands refusal for a category of requests, the eval catches the prompt rewrite that made the agent reluctantly comply in 3% of cases.

The second-order effect is that eval coverage becomes legible. "We have 40 evals" means nothing; "we have an eval for every invariant and every error branch in the spec" means something. The spec gives you a denominator — and that denominator is what makes "our eval suite is complete" a falsifiable claim rather than a vibe.

Reviews become diffs against a spec

Under prompt-as-spec, a prompt PR looks like this: three lines changed in a 200-line file, no explanation of which behavior those lines change, and a reviewer who either rubber-stamps it ("looks fine") or blocks it for a week while they reconstruct the author's mental model. Neither is a good outcome. Rubber-stamping misses regressions; blocking wastes time on work that was probably correct.

Under spec-first, a prompt PR has two possible shapes. The prompt changed but the spec didn't — this is a prompt edit claiming the contract is unchanged, and the reviewer's job is to verify that the eval suite (generated from the unchanged spec) still passes. The review is mechanical: did CI go green? If yes, merge. The other shape: the spec changed and the prompt changed. Here the spec change is the real review — what new invariant, error branch, or refusal is being added or removed? Is the change intentional? Does the corresponding eval update catch the new behavior correctly? The prompt change follows from the spec change and rarely needs deep review on its own.

The shape that should no longer appear is the one that used to dominate: prompt changed, spec unchanged in PR body or commit message, and the reviewer has to guess whether the behavior change was intended. If the prompt edit fails the existing evals, the spec should change — but only explicitly, with the new evals landing in the same PR. The review burden shifts from "read the entire prompt and decide if this edit breaks it" to "read the spec diff and decide if this behavior change is intentional." A three-line prompt edit that comes with no spec diff and passes existing evals is a trivial review. A prompt edit that demands a spec change gets the scrutiny it deserves.

From prompt owner to contract owner

The organizational payoff is the one worth the setup cost. Under prompt-as-spec, the agent's behavior is legible only to the person who wrote it. That person becomes a bottleneck — every edit routes through them, every question requires their archaeology, every on-call rotation pages them whenever the agent misbehaves. Teams try to solve this by rotating the role and accept that each new owner needs three to six months to absorb the prompt before they can make non-trivial changes safely. Three to six months is an optimistic estimate in my experience; some prompts never get absorbed by a second owner and ossify when the first owner leaves.

Spec-first changes the shape of ownership. The contract is an artifact a new engineer can read in an afternoon. They don't need to know why each clause of the prompt exists; they need to know what the agent is supposed to do, which the spec states explicitly. When they make a change, they diff the spec first, run the evals generated from the spec, and only then touch the prompt to implement the new behavior. The prompt's internal phrasing becomes an implementation detail — the thing a future refactor can change without approval so long as the spec and the evals are preserved.

The onboarding time for a new owner collapses from months to about a week — read the spec, read the evals, read the prompt with spec in hand to see how each section implements the contract, then ship a small change end-to-end to prove the loop works. That's not a guess; it's what teams that have actually made this transition report. The role changes name too: "prompt owner" (an individual who knows why it says what it says) becomes "contract owner" (a role that a named engineer holds this quarter and hands off next quarter without a knowledge transfer crisis).

What to write tomorrow

If your prompt is under 500 tokens and owned by one person who's staying for the foreseeable future, spec-first is probably overhead you don't need. If your prompt is over 1,500 tokens, has more than one author in the commit history, or ships behavior that affects money or safety, the overhead has already been paid — you just haven't noticed because the bill comes due as review friction, onboarding cost, and the slow freeze of a prompt that nobody dares to change.

The minimum viable version of spec-first isn't a framework or a tool. It's a markdown file next to the prompt that states the inputs, outputs, invariants, error cases, refusal semantics, and escalation triggers — the six categories above — in the most concrete language you can write. Commit it. The next prompt PR requires a spec diff or a claim that the spec is unchanged. The PR after that requires evals generated from the spec. In a month the prompt is one of three artifacts instead of the only one, and the agent's behavior belongs to the team instead of to its author.

The prompt stops being the spec. That's the whole move.

References:Let's stay in touch and Follow me for more thoughts and updates