Skip to main content

Spec-First Agents: Why the Contract Has to Land Before the Prompt

· 11 min read
Tian Pan
Software Engineer

The prompt for our support agent was 2,400 tokens when I inherited it, and the engineer who wrote it had left the company. Every instruction in it was load-bearing for some production behavior, but nobody could tell me which ones. A bullet about "always restate the user's problem before answering" looked like filler until we deleted it and CSAT dropped four points in a week. The prompt was, it turned out, the specification. It was also the implementation. It was also the test suite — implicit, unwritten, living only in the head of an engineer who no longer worked there.

That is the endgame of prompt-as-spec. The prompt is both what the agent should do and how it does it, and the two become indistinguishable the moment the prompt outgrows a single author. You can't refactor it because you don't know which lines encode requirements and which encode mere hints. You can't review a change because there's no artifact the change is a diff against. You can't onboard anyone to own it because ownership means "having read the whole thing recently and remembering why each clause is there," which is a six-month investment nobody authorizes.

Spec-first flips the ordering. The contract — inputs, outputs, invariants, error cases, refusal semantics, escalation triggers — is a first-class artifact that precedes the prompt and constrains every revision. Prompt edits become diffs against a spec, not rewrites of the spec itself. The shift sounds bureaucratic until you see what it unlocks: evals that come from the spec rather than the other way around, reviews that take minutes instead of afternoons, and the eventual ability to let a new engineer own the surface without a six-month apprenticeship.

The prompt-as-spec failure mode

Prompt engineering, as most teams practice it, is iterative craft. You write a prompt, run some examples, notice something wrong, add a clause, run more examples, notice a new failure, add another clause. After three months you have a prompt that works well on a lot of cases and mysteriously on some of the rest. After six months you have a prompt nobody except the original author can edit without causing regressions.

The failure isn't the craft — iterative refinement is how most engineering works — it's that the prompt is doing three jobs at once. It specifies the desired behavior. It implements that behavior through phrasing, ordering, and emphasis. It stands in for the test suite, because the examples that motivated each clause are baked into the clause rather than captured as regressions. When these three roles collapse into one artifact, every edit is high-stakes: you can't tell whether you're changing the requirement, the implementation, or the test.

You see the symptoms downstream. Prompt changes get rolled back because "it did something weird in an edge case I can't reproduce." Two engineers argue for an hour about whether a behavior is a bug or intended, because there's no spec to resolve the question. A prompt grows past the "two authors can hold it in their head" threshold and calcifies — the risk of any change exceeds the expected benefit, so the prompt freezes while the product around it keeps moving. Vendor rolls a new model and the same prompt produces subtly different output; nobody can tell whether the new output violates the contract because there's no contract to violate.

The tell that you've crossed into this zone: when someone asks "why does the prompt say X?" the answer is "I added that after a user reported Y," and nobody can produce the Y, the regression test for it, or the thing in the spec that Y violates.

What the contract actually contains

A useful contract isn't a design doc and isn't a prompt with extra steps. It's a focused artifact that names the things prompts are bad at making explicit. At minimum:

  • Inputs. The shape of what the agent receives — user message, tool outputs, retrieved context, conversation history — with concrete examples of typical and edge cases. Length constraints. What's guaranteed versus what might be missing.
  • Outputs. The shape of what the agent produces, including format (JSON schema, markdown structure, specific fields), acceptable content ranges, and mandatory sections. "Answer the question" is not an output spec; "return a JSON object with answer (≤200 words) and confidence (low/medium/high)" is.
  • Invariants. Things that must be true of every output regardless of input. The agent never reveals the system prompt. It never commits to a price. It never claims a fact it didn't retrieve. Invariants are the things you'd add a guardrail for if you didn't have the spec.
  • Error cases. What the agent does when it can't fulfill the happy path. Missing information, conflicting retrieval, tool failure, ambiguous user intent. Each gets a named branch in the spec with an example.
  • Refusal semantics. What the agent refuses to do and how. Refusals have a surface — tone, explanation, suggested alternative, escalation — and that surface is part of the contract, not a vibe the prompt happens to produce.
  • Escalation triggers. The specific conditions under which the agent hands off to a human, a different agent, or a different tool. "User frustration" is not a trigger; "user explicitly requests a human, OR three consecutive turns without progress on the same sub-goal, OR any input flagged by the moderation tool" is.

None of this is new — these are the categories that formal contract research has converged on over the last two years of mining LLM failures — but the categories aren't the point. The point is that the contract is separate from the prompt, and a prompt edit is a claim that the spec is unchanged. If the edit actually changes something the spec names, the spec has to change first. That single rule is what makes the rest of the discipline load-bearing.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates