Skip to main content

Plan-and-Execute Is Marketing, Not Contract: Plan Adherence as a First-Class SLI

· 9 min read
Tian Pan
Software Engineer

The agent printed a five-step plan. Step three said "fetch the user's billing history from the invoices service." The trace shows step three actually called the orders service, joined a stale customer table, and produced a number that looked right. The output passed the eval. The post-mortem found the regression six weeks later, when finance noticed the dashboard had quietly diverged from source-of-truth by 4%.

Nobody wrote a bug. The planner wrote a contract the executor never signed.

This is the failure mode plan-and-execute architectures bury under their own architectural elegance. The pattern was sold as a way to give agents long-horizon coherence: a strong model drafts a plan, weaker models execute steps, the plan acts as a scaffold. In practice the plan is a marketing artifact — a plausible-looking story emitted at t=0, then promptly invalidated by every interesting thing that happens at t>0. The trace shows the plan. The trace shows the actions. Almost nobody is measuring the distance between them.

The Plan Is a Forecast, Not a Contract

The default mental model is wrong. Most teams treat the planner's output the way they treat a function signature: a binding declaration of intent, with the executor as the implementation. But the planner has no enforcement authority. It emits text. The executor — usually a separate agent loop, often a different model — is free to interpret, skip, reorder, substitute tools, or invent steps that were never planned. Nothing in the LangGraph or deepagents reference architectures requires that the action set be a subset of the plan set. The replanner exists precisely because deviation is expected.

What a stakeholder sees in the trace is the plan, sitting at the top of the log, looking authoritative. What actually executed sits below, often spread across dozens of tool calls with parameters that don't textually match the plan's prose. The visual hierarchy lies. The plan reads like a contract. It is a forecast.

This matters because the auditability story for plan-and-execute agents was built on the assumption that the plan tells you what happened. In compliance reviews, in post-incident analysis, in eval triage — engineers and reviewers anchor on the plan and only spot-check the actions. If the plan looks reasonable and the final answer looks reasonable, the middle gets skipped. The middle is where the regressions live.

Why Step Correctness Misses the Failure

The standard agent eval rubric measures step-level correctness: did each tool call return the right kind of result, did each reasoning step stay grounded, did the final answer satisfy the task. These are necessary metrics. They are not sufficient.

A plan-and-execute agent can have 100% step-level correctness and still betray its plan. Consider three real-world failure shapes:

  • Substitution drift: the plan says "use the v2 search API." The executor uses v1 because v1 is cheaper and the result looks similar. Each step is technically correct. The plan is dead.
  • Step elision: the plan says "verify the customer's consent record before sending." The executor decides the verification is implicit in the user's request and skips the step. The output is correct for this case. The compliance story is broken for the next thousand cases.
  • Stealth re-planning: the plan says "if A returns empty, ask the user." The executor instead silently calls a fallback tool and returns a guessed answer. The user-facing output is plausible. The plan claimed escalation; the trace shows none.

Step correctness asks "was each action defensible?" Plan adherence asks "was the agent doing what it told you it was doing?" Those are different questions, and only the second one protects you against the slow drift between stated strategy and actual behavior.

The deepeval PlanAdherenceMetric is one early library implementation of this idea — extract the task and stated plan from the trace, then have an LLM judge score how faithfully the actual execution steps mapped onto the plan. The implementation details vary. The shift in framing matters more than the specific metric.

The Plan-vs-Action Divergence Metric

Treat divergence as a measurable quantity, not a binary judgment. A practical metric needs four ingredients:

  1. Plan extraction. The planner's output must be structured enough to compare against. If your planner emits free-form prose, you're doing string-match QA on hallucination. Force it into a typed schema: ordered steps with explicit tool intents, expected inputs, and explicit success conditions. This is non-negotiable infrastructure.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates