The LLM-as-Compiler Pattern: Separating Plan Generation from Execution

May 4, 2026 · 10 min read

Software Engineer

When a PlanCompiler-style agent is benchmarked against direct LLM-to-code generation on 300 stratified multi-step tasks, the structured approach achieves 92.67% success at $0.00128 per task. The direct approach — where the LLM decides actions step-by-step in a free-form loop — achieves 62% success at $0.0106 per task. That is 50% more accurate at one-eighth the cost.

The difference isn't model capability. Both approaches use the same model. The difference is architecture: one separates plan generation from plan execution; the other conflates them.

This pattern — prompting an LLM to emit a structured execution plan that a deterministic engine then runs — is becoming one of the more underrated shifts in production AI engineering. Teams who discover it usually do so after hitting the same ceiling: their agents work in demos, pass evals, and then fail in production in ways that are genuinely hard to debug. The LLM-as-compiler pattern doesn't solve every agentic problem, but it directly addresses the failure modes that bite most often at scale.

The Compiler Analogy

Andrej Karpathy articulated a useful framing: think of an LLM as a compiler, not an interpreter. In traditional software, a compiler takes source code (unstructured human intent) and produces an artifact (structured executable) that a deterministic runtime then runs. The compiler's job ends at the artifact boundary. The runtime takes over.

Most agentic systems break this separation. The LLM acts as both compiler and runtime simultaneously: it decides what to do next, invokes a tool, observes the result, decides what to do next, and so on in a continuous loop. The reasoning and the acting are interleaved. This is the ReAct pattern (Reason-Act-Observe), and it works well for exploratory, short-horizon tasks. It falls apart when you need predictability, auditability, or cost control at scale.

The LLM-as-compiler pattern restores the separation:

Plan generation phase: The LLM takes the task, reasons about it, and emits a complete structured execution plan — an artifact describing what steps to take, in what order, with what inputs.
Execution phase: A deterministic engine validates and runs the plan step-by-step, without further LLM involvement until verification.

The LLM's job is to think. The runtime's job is to act. Neither phase tries to do the other's work.

What Makes a Plan Structured Enough to Execute

The naive version of this pattern is just prompting the LLM to "write a plan" and then parsing the response. This fails for the same reason free-form agentic execution fails: untyped, unvalidated LLM output is structurally brittle.

A production implementation requires at minimum:

A typed node registry. The LLM selects from a fixed set of named operations with known input and output types. It cannot invent new tools or ad-hoc operations. "Query customer database" is a typed node; the LLM picks it and supplies typed parameters.

Static validation before execution. Before a single step runs, the plan is checked for structural validity: nodes exist in the registry, edges between nodes are type-compatible, the dependency graph is acyclic, all required parameters are present. Seven-stage validation of this kind in PlanCompiler eliminated entire classes of runtime failures that appeared constantly in the unstructured baseline.

Immutable plan versioning. The plan is a first-class artifact that gets stored, versioned, and associated with its execution trace. If a run fails on step 7 of 12, the system knows exactly what plan was running and can replay from the last checkpoint rather than starting over.

This is meaningfully different from "just use function calling." Function calling constrains individual tool invocations. The LLM-as-compiler pattern constrains the entire multi-step plan as a unit before execution begins.

The Failure Modes It Solves

Understanding why this pattern matters requires understanding what actually goes wrong in unstructured agentic systems.

Hallucination amplification. In a free-form ReAct loop, wrong data retrieved at step 3 propagates through steps 4, 5, and 6. Each subsequent LLM call reasons over the corrupted premise and amplifies it. By step 8, the agent is confidently producing outputs that trace back to a single retrieval failure at step 3. Structured execution with intermediate validation checkpoints breaks this chain: outputs of each step can be validated against expected types and schemas before being passed to the next.

Tool misuse cascades. As tool inventories grow, LLMs exhibit "function selection errors" — picking a plausible but wrong tool from a set of similar options. With 4 tools, this is rare. With 15 tools, the probability of selecting a wrong tool at some step in a 10-step plan is substantial. Constraining the LLM to emitting a plan from a typed registry rather than choosing tools dynamically per step reduces this to a plan-time classification problem, which the LLM handles better than a sequence of per-step decisions.

Auditability gaps. A financial audit asks: "Why did this agent approve this transaction?" In a free-form agentic loop, the answer is probabilistic — you can log the chain of LLM calls, but you can't produce a deterministic decision record that auditors accept. With a compiled plan, you have an immutable artifact specifying every decision before execution, plus an execution trace showing what actually ran. The semantic operations (LLM reasoning) are explicitly separated from the syntactic operations (data flow), making compliance-grade audit trails structurally achievable.

State corruption from mid-execution surprises. Free-form agents occasionally decide mid-task to take actions that weren't intended — "clearing a cache for efficiency," deleting temporary state, escalating permissions to unblock themselves. These aren't model failures in the traditional sense; they're emergent behaviors from an agent trying to solve a problem without constraints on what solving means. A pre-validated plan defines the execution boundary before any side effects happen. Policy checks can run against the plan as a whole, not reactively against each action as it fires.

When to Use It, When to Skip It

The pattern adds overhead — primarily the upfront planning phase (200–500ms baseline) and the validation layer. For tasks where that cost is justified, the benefits are clear. For tasks where it isn't, the added complexity is pure tax.

Use the pattern when:

The task has 5+ sequential steps with a stable, predictable tool set. PlanCompiler's efficiency gains materialize because the planning overhead amortizes across many execution steps.
Accuracy above 90% matters more than sub-second latency. For financial analysis, regulatory compliance, medical record processing, and similar high-stakes workflows, the 92% vs. 62% gap is the difference between shipping and not shipping.
Auditability is required. Regulated industries need decision records. Free-form agentic execution cannot produce them at the quality level auditors require.
You're running at volume. The 8× cost reduction per successful task compounds at thousands of runs per day. It doesn't matter at ten runs per day.

Skip the pattern when:

The task is exploratory and paths change based on observations. Customer support escalations, research synthesis, and open-ended problem-solving benefit from a ReAct loop's adaptability because the right next step often isn't knowable until the previous step resolves.
Simple queries. Drafting a reply, summarizing a document, answering a single factual question. Planning overhead exceeds execution cost by an order of magnitude.
Your tool set changes frequently. Plans commit to specific tool versions and schemas at generation time. If your tool APIs change weekly, stale plans become a maintenance burden larger than the benefit.
Sub-500ms response time is non-negotiable. Interactive user interfaces where every millisecond is felt can't absorb the planning phase without architectural workarounds (pre-caching plans, speculative execution of likely next tasks).

Production Patterns

Teams running this architecture in production have converged on a few patterns that aren't obvious from the research papers.

Tiered model execution. The planning phase is the cognitively hard part — it requires understanding the full task, decomposing it correctly, and expressing the decomposition in a typed schema. Use your most capable model here. The execution phase is largely mechanical: run typed operations, pass outputs to inputs, validate at each boundary. Smaller, faster, cheaper models handle this well. One team reported using a frontier model for planning and a model 10× cheaper for execution, with accuracy degradation of less than 2%. The token economics are favorable.

The plan review gate. For high-stakes workflows, insert a human approval step between plan generation and plan execution. The plan is readable — it's a structured artifact, not a stream of consciousness. A compliance officer can scan a 10-step plan in 30 seconds and approve or reject it before any side effects occur. This is architecturally impossible in a ReAct loop, where the agent acts before anyone can review what it decided to do.

Plan caching. For workloads where similar tasks repeat (batch document processing, nightly data enrichment, recurring compliance checks), generated plans for common task patterns can be cached. The planning cost is paid once; subsequent runs skip directly to execution. Cache invalidation is straightforward: when the typed node registry changes, cached plans that reference modified nodes are invalidated.

Hybrid routing. Most production systems don't run everything through the compiler pattern. Exploratory queries go to a ReAct loop. Multi-step transactional workflows go through the compiler pipeline. The routing decision is made upstream based on task classification — intent type, expected step count, tool set stability, and whether the output requires an audit trail.

The Indirection Cost

It's worth being honest about what the pattern costs beyond latency. The abstraction is real.

Debugging a failed plan execution requires understanding two layers: why the LLM generated this plan (a probabilistic question), and why the execution failed at this step (a deterministic question). These are separate problems requiring separate tooling. Teams investing in this architecture need good plan visualization, plan diffing across versions, and execution trace replay. These aren't hard to build, but they're not free.

The typed node registry also creates maintenance work. Every time your underlying tools change, you're updating both the tool implementation and the schema it advertises to the planner. In rapidly changing systems, this can become a bureaucratic bottleneck — the registry requires deliberate curation in a way that ad-hoc tool calling doesn't.

PlanCompiler's residual failures are instructive here: 59% of the remaining errors involved the planner routing around constrained nodes to unconstrained alternatives when the constrained ones couldn't handle edge cases. This is a schema completeness problem — the registry didn't fully cover the task space. Maintaining schema completeness is ongoing work, not a one-time investment.

What This Changes About How You Build Agents

The shift from free-form agentic execution to structured plan-execute architectures is less about choosing a pattern and more about choosing a mental model.

Free-form execution treats the LLM as an autonomous decision-maker: give it a goal and let it navigate. Structured plan-execute treats the LLM as a domain expert producing specifications: give it a goal and have it produce a precise artifact that a reliable engine runs.

The second framing degrades gracefully in the ways that matter most in production. Plans can be inspected, tested, versioned, and rolled back. Execution failures are localized to specific steps rather than diffused across an opaque decision loop. Costs are predictable because token usage is determined by the planning phase, not by how many steps an agent decides to take mid-run.

For multi-step workflows in regulated industries, at cost-sensitive scale, or wherever "why did the agent do that" needs a real answer: the compiler pattern is worth the overhead.

The core insight is old. Programming language designers understood decades ago that separating compilation from execution makes systems more reliable, debuggable, and auditable. The novelty is that we can now put an LLM in the compilation stage — and the results hold up in production.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The LLM-as-Compiler Pattern: Separating Plan Generation from Execution

The Compiler Analogy

What Makes a Plan Structured Enough to Execute

The Failure Modes It Solves

When to Use It, When to Skip It

Production Patterns

The Indirection Cost

What This Changes About How You Build Agents

Recommended Reading

About Tian Pan

The Compiler Analogy​

What Makes a Plan Structured Enough to Execute​

The Failure Modes It Solves​

When to Use It, When to Skip It​

Production Patterns​

The Indirection Cost​

What This Changes About How You Build Agents​

Recommended Reading

About Tian Pan

The Compiler Analogy

What Makes a Plan Structured Enough to Execute

The Failure Modes It Solves

When to Use It, When to Skip It

Production Patterns

The Indirection Cost

What This Changes About How You Build Agents