Skip to main content

Plan-and-Execute Is Marketing, Not Contract: Plan Adherence as a First-Class SLI

· 9 min read
Tian Pan
Software Engineer

The agent printed a five-step plan. Step three said "fetch the user's billing history from the invoices service." The trace shows step three actually called the orders service, joined a stale customer table, and produced a number that looked right. The output passed the eval. The post-mortem found the regression six weeks later, when finance noticed the dashboard had quietly diverged from source-of-truth by 4%.

Nobody wrote a bug. The planner wrote a contract the executor never signed.

This is the failure mode plan-and-execute architectures bury under their own architectural elegance. The pattern was sold as a way to give agents long-horizon coherence: a strong model drafts a plan, weaker models execute steps, the plan acts as a scaffold. In practice the plan is a marketing artifact — a plausible-looking story emitted at t=0, then promptly invalidated by every interesting thing that happens at t>0. The trace shows the plan. The trace shows the actions. Almost nobody is measuring the distance between them.

The Plan Is a Forecast, Not a Contract

The default mental model is wrong. Most teams treat the planner's output the way they treat a function signature: a binding declaration of intent, with the executor as the implementation. But the planner has no enforcement authority. It emits text. The executor — usually a separate agent loop, often a different model — is free to interpret, skip, reorder, substitute tools, or invent steps that were never planned. Nothing in the LangGraph or deepagents reference architectures requires that the action set be a subset of the plan set. The replanner exists precisely because deviation is expected.

What a stakeholder sees in the trace is the plan, sitting at the top of the log, looking authoritative. What actually executed sits below, often spread across dozens of tool calls with parameters that don't textually match the plan's prose. The visual hierarchy lies. The plan reads like a contract. It is a forecast.

This matters because the auditability story for plan-and-execute agents was built on the assumption that the plan tells you what happened. In compliance reviews, in post-incident analysis, in eval triage — engineers and reviewers anchor on the plan and only spot-check the actions. If the plan looks reasonable and the final answer looks reasonable, the middle gets skipped. The middle is where the regressions live.

Why Step Correctness Misses the Failure

The standard agent eval rubric measures step-level correctness: did each tool call return the right kind of result, did each reasoning step stay grounded, did the final answer satisfy the task. These are necessary metrics. They are not sufficient.

A plan-and-execute agent can have 100% step-level correctness and still betray its plan. Consider three real-world failure shapes:

  • Substitution drift: the plan says "use the v2 search API." The executor uses v1 because v1 is cheaper and the result looks similar. Each step is technically correct. The plan is dead.
  • Step elision: the plan says "verify the customer's consent record before sending." The executor decides the verification is implicit in the user's request and skips the step. The output is correct for this case. The compliance story is broken for the next thousand cases.
  • Stealth re-planning: the plan says "if A returns empty, ask the user." The executor instead silently calls a fallback tool and returns a guessed answer. The user-facing output is plausible. The plan claimed escalation; the trace shows none.

Step correctness asks "was each action defensible?" Plan adherence asks "was the agent doing what it told you it was doing?" Those are different questions, and only the second one protects you against the slow drift between stated strategy and actual behavior.

The deepeval PlanAdherenceMetric is one early library implementation of this idea — extract the task and stated plan from the trace, then have an LLM judge score how faithfully the actual execution steps mapped onto the plan. The implementation details vary. The shift in framing matters more than the specific metric.

The Plan-vs-Action Divergence Metric

Treat divergence as a measurable quantity, not a binary judgment. A practical metric needs four ingredients:

  1. Plan extraction. The planner's output must be structured enough to compare against. If your planner emits free-form prose, you're doing string-match QA on hallucination. Force it into a typed schema: ordered steps with explicit tool intents, expected inputs, and explicit success conditions. This is non-negotiable infrastructure.
  2. Action attribution. Every executed tool call must carry a back-reference to the plan step it claims to satisfy. Either the executor emits it (best, hardest) or a post-hoc judge assigns it. Without attribution, you can't compute coverage.
  3. A divergence score. Two components: unmet plan steps (planned-but-never-attempted) and unplanned actions (executed-but-never-planned). Both normalize over the plan length. Track them separately — they fail for different reasons.
  4. A deviation classifier. Not every divergence is a defect. Some are legitimate replanning in response to environment change. The classifier sorts divergences into authorized (replan event with stated reason), tolerated (minor parameter substitution within an allowlist), and unauthorized (everything else).

Ship divergence as an SLI with the same rigor as latency or error rate. Set a target. Alert when it breaches. The temptation will be to treat divergence as a quality signal you grade once a month in an offline eval set. Resist that. Drift in this metric is your earliest signal that the planner and executor have stopped agreeing on what the agent is, and offline evals lag prod by weeks.

A useful starting threshold: unauthorized divergence under 2% of executed actions, plan coverage above 90%. Tune from there. The absolute numbers matter less than catching the day they move.

Plan-as-Contract Enforcement

Measurement without enforcement turns into a dashboard nobody reads. The architecture must let the plan bind execution, not just suggest it.

Explicit deviation authorization. When the executor needs to deviate, force it through a single chokepoint that produces an audit record: what step it intended to skip or substitute, why, and what the new action will be. This is the same discipline as catching exceptions instead of swallowing them. Cheap to implement (one well-named function the executor must call), expensive to retrofit. Build it on day one or you will not build it.

Bounded re-planning budgets. A common production failure is the agent that spirals: an action fails, the replanner fires, the new plan also fails, replanner fires again, and forty steps later the bill is enormous and the user got nothing. Cap replanning attempts per task. Three is a defensible default; the right number depends on your task distribution. When the cap is hit, hand off to a refusal path that returns a clear failure to the user instead of letting the agent try to look productive forever. This is an availability decision dressed up as a quality decision.

Plan immutability windows. Once the planner has emitted a plan and the executor has begun, treat the plan as read-only for some bounded prefix. Replanning before that boundary is suspicious — it usually means the planner emitted a plan it didn't actually believe, or the executor is dodging steps it doesn't want to run. Either is worth investigating.

Re-plan diff visibility. When a replan happens, log the diff against the previous plan, not just the new plan in isolation. The diff is the actual signal. A new plan that drops the consent-check step and adds a fallback path is doing something completely different from one that swaps the order of two retrievals. Without the diff, every replan looks the same in the trace.

The Organizational Hazard

The deepest failure mode of plan-and-execute architectures is not technical. It is that plans are persuasive to humans in a way that traces are not.

A product manager reads the plan and sees intent. A compliance reviewer reads the plan and sees a control point. An executive reads the plan in a demo and signs off on the agent for production. Each of them is, in effect, validating the planner — which is the easy part of the system. None of them are validating the executor, which is where the work actually happens and where the failures actually originate.

The organizational fix is to make plan adherence a top-line number in the same dashboards where success rate lives. Not a separate eval doc. Not a tab in an observability tool nobody opens. The same dashboard. When success rate is 98% and adherence is 71%, the team needs to feel that as a contradiction worth resolving, not as two unrelated facts.

The audit-trail paradox in agentic AI is well-known by now: logs are not proof. A plan in a log is even less proof. Without an enforced contract between planner and executor, you have a system that is auditable in theory and unfalsifiable in practice — every action can be retroactively rationalized as "consistent with the plan if you read it generously."

Generous reading is what bugs depend on.

Treat Adherence Like an SLI

The shift required is small in code and large in posture. Stop treating the plan as a description of what the agent did. Start treating it as a hypothesis the executor either confirms or refutes, with metrics, alerts, and budgets attached to the gap. The plan-vs-action divergence metric belongs in the same tier as latency p99 and error rate. Bounded replanning budgets belong in the same tier as request timeouts. Deviation authorization belongs in the same tier as audit logging.

If you are building plan-and-execute agents in 2026 and you cannot answer the question "what percentage of our planned steps actually executed last week?" — the plan is doing brand work, not engineering work. Fix the contract, not the prose.

References:Let's stay in touch and Follow me for more thoughts and updates