Skip to main content

The Prompt Graph Inside Your Agent: Cross-Prompt Regression Chains Nobody Mapped

· 11 min read
Tian Pan
Software Engineer

A senior engineer ships a four-word edit to the planner prompt — "if uncertain, ask first." The planner's own eval set, which grades whether plans are reasonable, moves up by half a point. They merge. Two weeks later, the verifier's eval shows a three-point pass-rate regression and nobody can repro it. The root cause turns out to be that the planner now asks more clarifying questions, the executor receives shorter task descriptions on the second turn, the verifier's rubric was implicitly tuned against the previous executor's longer outputs, and an edit nobody flagged as risky has shifted three downstream distributions at once.

This is what happens when you treat the prompts inside an agent as a flat folder of files instead of as a graph with edges. The prompts have owners. The edges between them have nobody.

Agentic systems accumulate prompts the same way services accumulate endpoints. A modest agent has a planner, an executor, a verifier, a summarizer, maybe a router on top and a memory-write prompt at the bottom. Each of those is a string that gets versioned, reviewed, and graded against its own eval set. What is not versioned, reviewed, or graded is the dependency graph — the fact that the planner's output is the executor's input, and the executor's output is the verifier's input, and the verifier's rubric was calibrated against an executor that no longer exists because somebody else changed the executor prompt last sprint.

Prompts Form a Graph; Your Repo Pretends They're a Folder

Walk into most AI codebases and the prompts live in prompts/. Sometimes there's a subdirectory per agent stage. Sometimes there's a versions.yaml next to each one. Almost never is there an artifact that names the edges: which prompt's output is consumed by which prompt's input, what shape that handoff takes, and which eval grades that specific edge rather than the endpoint behavior.

This omission is structural, not lazy. Version control is built around files, and prompts fit that model. The eval suite is built around input-output pairs at the agent boundary, and prompts fit that model too. The graph between prompts is invisible to both tools because nothing in the standard tooling stack requires it to be visible. A new hire reads the prompt directory the way you'd read a folder of microservices with no architecture diagram: each file looks self-contained, the interfaces are implicit, and the only way to learn the topology is to break it.

The closest analogue from systems engineering is the difference between a service and a service mesh. A microservice in isolation is a binary that takes requests and returns responses. A microservice in a mesh is a node with versioned interfaces, contract tests against its consumers, and a deployment process that knows which downstream services to canary when this one ships. The team that runs services without a mesh accumulates incidents from cross-service drift. The team that runs prompts without a graph accumulates regressions from cross-prompt drift. The dynamics are identical; only the noun changes.

The Failure Mode Has Three Hops and Nobody Sees It

The canonical incident: a planner edit lands clean. Its own eval — call it planner_quality.yaml — runs against a fixed set of user queries and grades whether the plan is sensible. The eval moves up by a half-point. The PR is merged on Monday.

What the eval did not measure is that the new planner emits plans with a different distribution of step granularity. Where the old planner produced four medium-grained steps per task, the new one produces six fine-grained ones. The executor consumes these steps verbatim. Six fine-grained steps means six tool calls instead of four, which means more context accumulation, which means by the time the executor finishes the executor's response is longer and slightly more discursive than before.

The verifier reads the executor's response. Its rubric — written eight weeks ago by an engineer who has since rotated to another project — was implicitly tuned to expect the older, terser response style. The rubric scores discursive responses lower because the engineer who wrote it found discursive responses harder to grade and unconsciously penalized them. The verifier's pass rate drops by three points. The on-call sees the regression in a dashboard on Wednesday morning. The first instinct is to look at the verifier itself, then the executor, then the model version. The planner is the last suspect because the planner's own eval is green.

This is not a hypothetical. Every team running a multi-prompt agent for longer than a quarter has shipped one of these. The pattern is consistent: the symptom appears two hops away from the cause, the cause looks innocent in isolation, and the artifact that would have connected them — a per-edge regression eval — does not exist.

What a Prompt-Graph Artifact Actually Contains

The discipline that fixes this starts with a single artifact: a prompt-graph.yaml (or its equivalent) that lives in the repo, is reviewed in every prompt PR, and names every edge between prompts. For each prompt, it lists:

  • The prompt's inputs and where they come from. Not just "user query" but "planner output, executor previous turn, retrieval result from the docs index."
  • The prompt's consumers — every other prompt and every code path that reads this prompt's output.
  • The eval surface that grades each outgoing edge. Not just "does the planner produce good plans" but "does the planner produce plans that the executor can act on without re-asking for clarification."
  • A schema or contract for the handoff. The shape the downstream prompt expects, with at least one example.

This is not exotic. It is a service mesh manifest with the noun changed. The interfaces are versioned. The consumers are enumerated. The contract is explicit. The eval surface for each edge is named.

What this artifact buys you, in order of value: a PR author knows which downstream evals to run before merging; a reviewer knows what to ask for; a postmortem author has a topology to point at; a new hire has a map; and the team has a forcing function to stop treating "small prompt tweaks" as zero-blast-radius edits.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates