The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped
A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by get_order in step 1 has status shipped, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in pending. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.
This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (if status == "shipped"); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.
What makes this class of bug particularly bad is that it does not look like a bug. There is no exception. There is no schema violation. There is no hallucination in the obvious sense — the model is not inventing a status; it is recalling a status it saw earlier and acting on it. From the outside, the agent looks confident, fast, and wrong. Inside the trace, the evidence that justified the decision has been garbage-collected.
The Pruner Doesn't Know About the Plan
Context pruning, in most production setups today, is a stateless filter. It runs between agent turns. It looks at the message list, applies some policy — drop tool results older than N turns, keep messages under a token threshold, prefer recency, hold onto the system prompt — and returns a smaller message list. Anthropic's clear_tool_uses_20250919 strategy, for example, drops old tool results after a configurable token threshold and keeps only the N most recent. Other harnesses summarize old turns into a compact note, or selectively retain messages by an embedding-based relevance score.
All of these strategies share a property: they are oblivious to the plan. They do not know that step 3 of the plan reads "branch on the result of step 1." They cannot, because the plan is not a first-class object the pruner can inspect. The plan lives inside the model's reasoning tokens. The pruner lives outside, in the harness. There is no edge between "this plan step depends on this tool result" and "this tool result is a candidate for eviction."
The consequence is that the pruner is applying a stateless decision rule to a stateful planning trace. Anything the plan needs that the pruner does not happen to retain becomes invisible to the next step. The model, with admirable composure, fills in the missing fact from its in-context memory of having seen it earlier — except that memory is now a residue inside the reasoning trace rather than a fact it can verify. The branch fires anyway.
The Symptom Looks Like Hallucination
When this fails in production, the most natural place to start debugging is "the model hallucinated." That diagnosis is wrong, but understandably so. The symptom matches: the model asserted a fact (the order is shipped) that turned out to be false (it was pending), and acted on that assertion. Every external tool worked. The system prompt is unchanged. The model version is unchanged. By process of elimination, it must be the model.
The actual sequence is closer to a use-after-free. The fact was true when the model first observed it (or true enough — perhaps the order really was shipped at step 1 and changed between then and the branch, which is its own problem). The fact got referenced in the plan. The fact got pruned. The reference is dangling. The model, having no signal that the underlying memory is gone, dereferences it.
Production agent debugging tools have started to make this debuggable, but only just. Step-level tracing now captures every tool call, reasoning chain, and pruning event into a single timeline. If you have that, you can replay the pruning decision and see which tool results were dropped between step N and step N+1. What you typically still cannot see is which of those dropped results were needed by a future step, because nobody encoded that dependency.
The Chroma research on context rot — testing 18 frontier models across input lengths — shows that information accuracy follows a U-shaped curve: high at the start and end of context, 30%+ lower in the middle. Pruners that drop the middle are doing the right thing for that curve. But the middle is also where most plan-relevant tool results sit by the time the agent is six or eight steps in. The pruner's policy and the plan's needs disagree about which bytes are valuable, and the pruner wins because it runs first.
Why "Just Keep More Context" Doesn't Fix It
The first reflex is to raise the keep-count or push out the eviction threshold. Keep 20 tool results instead of 5. Use the million-token window your provider just shipped. Stop pruning. This works in the demo and falls over in production for two reasons.
First, context rot is real and cumulative. Chroma's data is unambiguous: every frontier model gets worse as input length increases, and the degradation starts well before the advertised context limit. A 1M-token window still rots at 50K tokens. Holding more context to avoid the eviction problem creates a different problem — the model becomes less reliable at finding the fact it needs even when the fact is right there.
Second, the cost math gets ugly fast. Long-running agents — coding agents are the obvious worst case — accumulate context exponentially as they search, read files, run tools, and backtrack. Coding agents at 35-minute task durations see success rates drop sharply, and doubling task duration quadruples the failure rate. The teams who solved this did not solve it by buying more tokens. They solved it by being more deliberate about which tokens stayed.
So pruning is not optional, and the pruner cannot just be told "keep everything." It has to know which things to keep. And right now, in most architectures, it has no way to know.
Plan Steps Need Dependency Edges to Their Evidence
The fix is structural. Treat the plan as a graph, not as a string. When the model generates a plan, capture which prior tool results, parameters, or facts each step depends on, and store those dependencies as first-class edges next to the plan. The pruner gets a new input: not just the message list, but a dependency manifest that says "tool result tc_42a is referenced by plan step 5, which has not yet executed; do not evict."
This is the same discipline that database query planners use to keep intermediate results alive while a query is executing — you do not garbage-collect a hash table that a later operator is going to probe against. It is the same discipline that distributed schedulers like Kubernetes use to pin pods to nodes they have declared affinity for. Agents have, so far, been written without the analog.
What the dependency manifest looks like in practice:
- A unique ID on every tool call result. Many SDKs already give you one.
- A planning step that, alongside the natural-language description, emits an explicit list of input IDs it will consume.
- A pruner that consults the manifest before evicting anything, treats any ID referenced by an unexecuted plan step as pinned, and only evicts results whose downstream dependencies have all completed.
This sounds heavyweight. In practice it is a small JSON object next to the plan, plus a five-line check in the pruner. The behavioral difference is large: plans no longer execute against vanished evidence, because the things plans depend on cannot vanish until the plan is done with them.
Validators at the Step Boundary
Pinning is necessary but not sufficient. Even with pinning, two things can still happen: a plan can be updated mid-execution to reference a result that has already been evicted, or a step can implicitly depend on context that was never registered as an input. Both of these will keep happening in real systems, because both the plan and the dependency manifest are LLM-generated and will sometimes be wrong.
The second piece is a per-step context validator. Before a plan step runs, the validator checks that every dependency the step claims to need is still present in context. If it is not, the step does not silently proceed. It either pauses to re-fetch the missing artifact, raises a structured error the orchestrator can catch, or triggers a re-planning pass with the surviving context as input.
This is the same pattern as a database transaction reading at a snapshot and noticing the snapshot is no longer available. The transaction does not pretend the data was the same as the last snapshot it remembers. It aborts and starts over, or escalates. The agent analog: a step that requires tc_42a and cannot find it must not improvise. It must either resurface the dependency or admit it cannot continue.
What you give up is a small amount of latency on edge cases — re-fetching a tool result that was pruned, or replanning on the surviving context. What you gain is that the agent stops executing plans whose justification is no longer in the room. Most teams I have seen instrument this discover that the rate of dropped-dependency events is low single digits per session — but the rate of bad outcomes attributable to those events is dramatically higher, because they bypass every other safeguard.
Coupling the Pruner and the Planner
The architectural realization underneath all of this is that context pruning and plan execution are not independent subsystems. They share a contract: the plan is allowed to reference any context that has not been evicted, and the pruner is not allowed to evict anything the plan is still going to reference. Most production agent frameworks ship these subsystems separately, owned by different parts of the harness, and the integration is "they both touch the message list."
When the integration is that loose, you get exactly the failure described at the top: a plan whose execution diverges from its reasoning, because the pruner did its job correctly under its own policy and the planner did its job correctly under its own assumptions, and the assumptions did not match. Nobody owns the edge case where they disagree.
The teams shipping reliable long-running agents are starting to treat this seam as load-bearing. Some build a single context manager that handles both planning state and pruning policy, so the two cannot drift. Some emit explicit plan-context dependency events into their trace format, so observability can flag dropped-dependency cases automatically. Some go further and treat the plan itself as part of the prunable context — with rules — so that if the plan references vanished evidence, the plan itself gets re-derived rather than re-executed against ghosts.
The common thread: stop treating the pruner as a generic context-shrinker and start treating it as a participant in plan execution. The pruner has to know what the plan wants. The plan has to declare what it needs. When the two agree, the agent stops branching on facts that are no longer there.
What to Audit on Monday
If your agent makes multi-step plans and prunes context between steps — almost every long-running agent does both — there is one audit that surfaces this problem quickly. Pick a sample of recent failed sessions where the failure was "the agent made a wrong decision but no error was raised." For each one, reconstruct the message list at the moment the wrong decision was made. Ask: was the evidence the plan implicitly cited still in the context window at that step?
If the answer is "no" more than a handful of times across your sample, you have a coupling problem between the planner and the pruner, and adding more guardrails to either one in isolation will not fix it. The next agent failure that looks like a hallucination is, more often than people want to admit, a context dependency that nobody declared and the pruner therefore ate. Decisions made against evidence that no longer exists will keep happening until the pruner stops being allowed to make that decision alone.
- https://www.morphllm.com/context-rot
- https://www.trychroma.com/research/context-rot
- https://platform.claude.com/cookbook/tool-use-context-engineering-context-engineering-tools
- https://conikeec.substack.com/p/context-editing-looks-like-a-feature
- https://machinelearningmastery.com/building-a-context-pruning-pipeline-for-long-running-agents/
- https://zoms.vercel.app/blog/implementing-dynamic-context-pruning-for-long-context-llm
- https://redis.io/blog/context-rot/
- https://www.braintrust.dev/articles/agent-observability-complete-guide-2026
- https://www.digitalapplied.com/blog/context-engineering-agent-reliability-playbook-2026
