Skip to main content

Six Context Engineering Techniques That Make Manus Work in Production

· 11 min read
Tian Pan
Software Engineer

The Manus team rebuilt their agent framework four times in less than a year. Not because of model changes — the underlying LLMs improved steadily. They rebuilt because they kept discovering better ways to shape what goes into the context window.

They called this process "Stochastic Graduate Descent": manual architecture searching, prompt fiddling, and empirical guesswork. Honest language for what building production agents actually looks like. After millions of real user sessions, they've settled on six concrete techniques that determine whether a long-horizon agent succeeds or spirals into incoherence.

The unifying insight is simple to state and hard to internalize: "Context engineering is the delicate art and science of filling the context window with just the right information for the next step." A typical Manus task runs ~50 tool calls with a 100:1 input-to-output token ratio. At that scale, what you put in the context — and how you put it there — determines everything.

KV-Cache Hit Rate Is the Real North Star Metric

Most teams building agents track accuracy, latency, and task completion rate. Manus tracks something most teams ignore until they see their billing: KV-cache hit rate.

The economics are stark. With Claude Sonnet, cached input tokens cost $0.30 per million; uncached tokens cost $3.00. That's a 10x cost difference. When your agent sessions involve hundreds of input tokens for every output token, cache misses don't just hurt — they make the product economically unviable at scale.

This led Manus to a set of engineering rules that sound trivial until you violate them:

  • Never put a timestamp precise to the second in your system prompt. A single character change invalidates the entire prefix cache.
  • Keep tool definitions stable across requests. If you add, remove, or reorder tools dynamically, every request pays full price.
  • Use append-only context structures. Any in-place modification to prior actions or observations breaks cache for everything after that point.
  • Ensure deterministic JSON key ordering. Many languages serialize maps in random order by default. This is a silent cache killer.

The broader implication: any technique that introduces dynamic variation into the stable prefix of your context is a hidden cost center. Freshened conversation summaries, state-injected system prompts, per-request prompt templating — all of these may be improving output quality while destroying cache efficiency. You have to measure both.

Mask Tools with Logit Control, Not by Removing Them

A natural way to manage tool availability in a stateful agent is to remove tools from the tool list when they're not applicable. If the agent is in a "writing" state, why show it filesystem tools? Less noise, fewer tokens, cleaner prompts.

This is the wrong approach. Removing tools from the tool list changes the stable prefix and kills your cache. Worse, if previous observations in the context reference a tool that's no longer defined, the model encounters a contradiction — it has evidence of using a tool that doesn't exist, which degrades reasoning quality.

Manus uses logit masking during decoding instead. Tool definitions stay constant in the prompt. At inference time, the decoding process masks out invalid tool tokens based on the current state. The model never sees conflicting signals between its context history and its available options.

They implement this with three function-calling modes. In "auto" mode, the model decides whether to call any function. In "required" mode, the model must make a function call but can choose any tool. In "specified" mode, the context is prefilled with a function name prefix, constraining the model to that tool family. Manus names tool families with consistent prefixes (browser_*, shell_*, file_*) specifically so that constraining a tool group requires only a single prefix mask rather than tracking individual tool states.

The tool count matters too. More tools reduce effectiveness. Tool descriptions consume tokens, and overlapping tool semantics create model confusion about which tool to use when. Manus keeps fewer than 20 atomic core functions. The intuition: a small, well-separated toolkit outperforms a comprehensive one with redundant entries.

Use the File System as Infinite External Context

Long-horizon tasks face a ceiling: tool outputs — web pages, code execution results, document contents — can be enormous. Even if they don't overflow the context window immediately, stuffing everything in degrades reasoning quality. Transformers exhibit "context rot" in very long windows, where information in the middle receives less attention and effectively disappears.

The standard solution is RAG: embed chunks, retrieve relevant ones, inject summaries. Manus rejects this as unnecessarily complex. Their answer: write outputs to the file system and let the agent retrieve them with basic Unix utilities — glob, grep, cat.

"The file system is unlimited in size, persistent by nature, and directly operable by the agent itself." No embedding models, no vector indices, no retrieval pipelines to tune and debug.

The critical constraint is what Manus calls restorable compression. When you compress context, you're only allowed to remove information if the agent can reconstruct it later. Specific rules:

  • Drop web page content after summarizing, but preserve the URL — the agent can re-fetch.
  • Omit document contents from context, but preserve the file path — the agent can re-read.
  • Never discard information with no path back to it.

This is the key distinction from lossy summarization. Summarization makes an irreversible decision about what's important. Restorable compression defers that decision to the agent at the moment it needs the information. The agent that forgot why it wrote a file will almost certainly be worse than the agent that can go look.

The practical upshot: for large tool outputs, default to writing them to disk and giving the agent a path. Keep in context only what the next tool call actually needs.

Force Attention Back to the Goal with Task Recitation

The "lost-in-the-middle" failure is well-documented in transformer research: information near the beginning and end of a context window receives more attention than information in the middle. For a 50-step agent task, early goals and instructions can effectively become invisible to the model by step 30.

Manus exploits this attention geometry intentionally. Their agents maintain a todo.md that's rewritten at the end of context at every step, pushing current task objectives into the model's high-attention recency zone. As the context grows, the global plan stays in view by floating to the end.

This is a form of continuous goal recitation. Rather than trusting the model to remember a task objective specified at step 0, the agent keeps re-asserting it as the most recent context entry. The implementation is simple enough that it's easy to dismiss, but the impact on goal coherence over long sessions is substantial.

The architecture evolved. Early Manus had the main executor agent manage todo.md directly, which wasted roughly one-third of all actions on task-list updates — actions that weren't progressing the task, just maintaining bookkeeping. The fix was a dedicated planner sub-agent that handles all task decomposition and list management, leaving the executor focused entirely on taking actions.

Preserve Errors in Context, Don't Sanitize Them

There's an intuitive but wrong instinct to clean up agent context by removing failed actions. The model tried something that didn't work, the observation was an error — why carry that forward? Replace it with a summary or just drop it.

Manus leaves the wrong turns in. Failed actions, error messages, stack traces — these stay in context.

The reasoning is mechanistic. The model needs to see what it tried and what the response was to update its implicit beliefs about what works. Error recovery — diagnosing a failure, adjusting the approach, trying something different — requires the evidence of failure. If you remove errors from context, you remove the model's ability to reason about why the current situation exists and what the failed path was.

The Manus team describes error recovery as "one of the clearest indicators of true agentic behavior" and notes it's consistently underrepresented in academic benchmarks, which tend to test agents on clean, well-specified tasks with no injected failures.

This has a sharp evaluation implication: if your benchmark doesn't include failure conditions and recovery scenarios, your performance numbers may significantly overstate how your agent will actually behave in production. The agents that score well on clean benchmarks and fall apart on messy real tasks often suffer from over-sanitized context pipelines.

Break Rhythmic Patterns with Controlled Observation Variation

Models are "excellent mimics." When an agent processes 50 structurally similar observations — 50 resume evaluations, 50 product pages, 50 API responses — the uniform action-observation rhythm induces pattern overgeneralization. The model starts treating each observation as a template instance rather than reasoning about it independently.

Manus counteracts this by introducing small amounts of structured variation in how observations are serialized: different templates, alternate phrasing, minor differences in ordering or formatting. The goal is to disrupt the rhythm enough that the model maintains independent attention to each observation rather than extrapolating from the pattern.

This advice is in direct tension with the KV-cache advice. The cache is built on stability; this technique requires variation. Manus navigates the conflict by being stable where caching matters — system prompt prefix, tool definitions, structural invariants — and variable where attention matters — the observation content that changes with every tool call anyway.

The takeaway is not "be inconsistent everywhere." It's that you should understand which parts of your context are cache-relevant (they need to be frozen) and which parts the model processes dynamically (some variation there can be beneficial). Most teams treat both the same.

Isolation Is Why Multi-Agent Architectures Actually Work

The popular narrative around multi-agent systems emphasizes specialization: a researcher agent, a coder agent, a reviewer agent, each with their persona and domain expertise. Manus's production experience points to a different benefit as primary: context isolation.

Sub-agents work because they get a clean context window, not because they embody a different role. A sub-agent given focused instructions with minimal prior context outperforms a single agent carrying 50 steps of accumulated state trying to also think about sub-task details.

For simple tasks, Manus has the planner pass only instructions via function call — the sub-agent gets minimal context and runs with clean isolation. For complex tasks requiring access to prior work, the planner shares trajectory context plus filesystem access, but the sub-agent still maintains its own tool definitions and system instructions. Results come back through a submit results tool with schema enforcement via constrained decoding, ensuring structured handoffs.

Model routing follows task type rather than agent persona: Claude for coding, Gemini for multimodal, OpenAI models for math-heavy reasoning. The multi-agent architecture makes this routing straightforward without contaminating any agent's context with irrelevant capability signals.

The test Manus applies to their harness design: swap in a stronger model. If agent performance doesn't improve, the harness is the bottleneck. Harness-constrained performance is the failure mode where the engineering around the model is suppressing what the model can actually do. This is a concrete quality test that most teams don't run explicitly but should.

What This Adds Up To

Manus's six techniques aren't independent heuristics. They form a coherent model of what a context window actually is during a long-horizon task.

The context isn't a transcript. It's a carefully managed working memory where what you include, what you compress, what you repeat, and what you preserve determines whether the agent can do its job step 50 the same way it did step 1. Treating context as a passive log of what happened is how you build agents that work in demos and degrade in production.

The economic dimension makes this non-optional. At 100:1 input-to-output ratios and 10x price differentials between cached and uncached tokens, context engineering isn't a quality improvement — it's a cost control measure. Teams that treat context as free will price themselves out at scale before they realize the bottleneck isn't the model.

The design philosophy that unifies all of it: "If model progress is the rising tide, we want to be the boat, not the pillar stuck to the seabed." Build the harness to float. Don't let your architecture constrain improvements you're about to get for free.

References:Let's stay in touch and Follow me for more thoughts and updates