Skip to main content

278 posts tagged with "ai-agents"

View all tags

Six Context Engineering Techniques That Make Manus Work in Production

· 11 min read
Tian Pan
Software Engineer

The Manus team rebuilt their agent framework four times in less than a year. Not because of model changes — the underlying LLMs improved steadily. They rebuilt because they kept discovering better ways to shape what goes into the context window.

They called this process "Stochastic Graduate Descent": manual architecture searching, prompt fiddling, and empirical guesswork. Honest language for what building production agents actually looks like. After millions of real user sessions, they've settled on six concrete techniques that determine whether a long-horizon agent succeeds or spirals into incoherence.

The unifying insight is simple to state and hard to internalize: "Context engineering is the delicate art and science of filling the context window with just the right information for the next step." A typical Manus task runs ~50 tool calls with a 100:1 input-to-output token ratio. At that scale, what you put in the context — and how you put it there — determines everything.

The Action Space Problem: Why Giving Your AI Agent More Tools Makes It Worse

· 9 min read
Tian Pan
Software Engineer

There's a counterintuitive failure mode that most teams encounter when scaling AI agents: the more capable you make the agent's toolset, the worse it performs. You add tools to handle more cases. Accuracy drops. You add better tools. It gets slower and starts picking the wrong ones. You add orchestration to manage the tool selection. Now you've rebuilt complexity on top of the original complexity, and the thing barely works.

The instinct to add is wrong. The performance gains in production agents come from removing things.

Four Strategies for Engineering Agent Context That Actually Scales

· 8 min read
Tian Pan
Software Engineer

There's a failure mode in production agents that most engineers discover the hard way: your agent works well on the first few steps, then starts hallucinating halfway through a task, misses details it was explicitly given at the start, or issues a tool call that contradicts instructions it received twenty steps ago. The model didn't change. The task didn't get harder. The context did.

Long-running agents accumulate history the way browser tabs accumulate memory — silently, relentlessly, until something breaks. Every tool response, observation, and intermediate reasoning trace gets appended to the window. The model sees all of it, which means it has to reason through all of it on every subsequent step. As context grows, precision drops, reasoning weakens, and the model misses information it should catch. This is context rot, and it's one of the most common failure modes in production agents.

Context Engineering: Memory, Compaction, and Tool Clearing for Production Agents

· 10 min read
Tian Pan
Software Engineer

Most production AI agent failures don't happen because the model ran out of context. They happen because the model drifted long before it hit the limit. Forrester has named "agent drift" the silent killer of AI-accelerated development — and Forrester research from 2025 shows that nearly 65% of enterprise AI failures trace back to context drift or memory loss during multi-step reasoning, not raw token exhaustion.

The distinction matters. A hard context limit is clean: the API rejects the request, the agent stops, you get an error you can handle. Context rot is insidious: the model keeps running, keeps generating output, but performance quietly degrades. GPT-4's accuracy drops from 98.1% to 64.1% based solely on where in the context window information is positioned. You don't get an error signal — you get subtly wrong answers.

This post covers the three primary tools for managing context in production agents — compaction, tool-result clearing, and external memory — along with the practical strategies for applying them before your agent drifts.

CLAUDE.md and AGENTS.md: The Configuration Layer That Makes AI Coding Agents Actually Follow Your Rules

· 9 min read
Tian Pan
Software Engineer

Your AI coding agent doesn't remember yesterday. Every session starts cold — it doesn't know you use yarn not npm, that you avoid any types, or that the src/generated/ directory is sacred and should never be edited by hand. So it generates code with the wrong package manager, introduces any where you've banned it, and occasionally overwrites generated files you'll spend an hour recovering. You correct it. Tomorrow it makes the same mistake. You correct it again.

This is not a model quality problem. It's a configuration problem — and the fix is a plain Markdown file.

CLAUDE.md, AGENTS.md, and their tool-specific cousins are the briefing documents AI coding agents read before every session. They encode what the agent would otherwise have to rediscover or be corrected on: which commands to run, which patterns to avoid, how your team's workflow is structured, and which directories are off-limits. They're the equivalent of a thorough engineering onboarding document, compressed into a form optimized for machine consumption.

Building AI Agents That Actually Work in Production

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents make the same mistake: they architect for sophistication before they have evidence that sophistication is needed. A production analysis of 47 agent deployments found that 68% would have achieved equivalent or better outcomes with a well-designed single-agent system. The multi-agent tax — higher latency, compounding failure modes, operational complexity — often eats the gains before they reach users.

This isn't an argument against agents. It's an argument for building them the same way you'd build any serious production system: start with the simplest thing that works, instrument everything, and add complexity only when the simpler version demonstrably fails.

Effective Context Engineering for AI Agents

· 11 min read
Tian Pan
Software Engineer

Nearly 65% of enterprise AI failures in 2025 traced back to context drift or memory loss during multi-step reasoning — not model capability issues. If your agent is making poor decisions or losing coherence across a long task, the most likely cause is not the model. It is what is sitting in the context window.

The term "context engineering" is proliferating fast, but the underlying discipline is concrete: active, deliberate management of what enters and exits the LLM's context window at every inference step in an agent's trajectory. Not a prompt. A dynamic information architecture that the engineer designs and the agent traverses. The context window functions as RAM — finite, expensive, and subject to thrashing if you don't manage it deliberately.

Mastering AI Agent Observability: Why Your Dashboards Are Lying to You

· 9 min read
Tian Pan
Software Engineer

Your agent is returning HTTP 200s. Latency is within SLA. Error rates are flat. Everything on the dashboard looks green — and your users are getting confidently wrong answers.

This is the core observability gap in AI systems: the metrics that traditionally signal system health are almost entirely irrelevant to whether your agent is actually doing its job. An agent can fluently hallucinate, skip required tools, use stale retrieval results, or reason itself into logical contradictions — all while your monitoring shows zero anomalies. The standard playbook for service observability doesn't transfer to agentic systems, and teams that don't understand this gap ship agents they can't trust, debug, or improve.

The 80% Problem: Why AI Coding Agents Stall and How to Break Through

· 10 min read
Tian Pan
Software Engineer

A team ships 98% more pull requests after adopting AI coding agents. Sounds like a success story — until you notice that review times grew 91% and PR sizes ballooned 154%. The code was arriving faster than anyone could verify it.

This is the 80% problem. AI coding agents are remarkably good at generating plausible-looking code. They stall, or quietly fail, when the remaining 20% requires architectural judgment, edge case awareness, or any feedback loop more sophisticated than "did it compile?" The teams winning with coding agents aren't the ones who prompted most aggressively. They're the ones who built better feedback loops, shorter context windows, and more deliberate workflows.

Systematic Debugging for AI Agents: From Guesswork to Root Cause

· 9 min read
Tian Pan
Software Engineer

When an AI agent fails in production, you rarely know exactly when it went wrong. You see the final output — a hallucinated answer, a skipped step, a tool called with the wrong arguments — but the actual failure could have happened three steps earlier. This is the core debugging problem that software engineering hasn't solved yet: agents execute as a sequence of decisions, and by the time you notice something is wrong, the evidence is buried in a long trace of interleaved LLM calls, tool invocations, and state mutations.

Traditional debugging assumes determinism. You can reproduce the bug, set a breakpoint, inspect the state. Agent debugging breaks all three of those assumptions simultaneously. The same input can produce different execution paths. Reproducing a failure requires capturing the exact context, model temperature, and external state at the moment it happened. And "setting a breakpoint" in a live reasoning loop is not something most agent frameworks even support.

Harness Engineering: The Discipline That Determines Whether Your AI Agents Actually Work

· 10 min read
Tian Pan
Software Engineer

Most teams running AI coding agents are optimizing the wrong variable. They obsess over model selection — Claude vs. GPT vs. Gemini — while treating the surrounding scaffolding as incidental plumbing. But benchmark data and production war stories tell a different story: the gap between a model that impresses in a demo and one that ships production code reliably comes almost entirely from the harness around it, not the model itself.

The formula is deceptively simple: Agent = Model + Harness. The harness is everything else — tool schemas, permission models, context lifecycle management, feedback loops, sandboxing, documentation infrastructure, architectural invariants. Get the harness wrong and even a frontier model produces hallucinated file paths, breaks its own conventions twenty turns into a session, and declares a feature done before writing a single test.

Building Governed AI Agents: A Practical Guide to Agentic Scaffolding

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents spend the first month chasing performance: better prompts, smarter routing, faster retrieval. They spend the next six months chasing the thing they skipped—governance. Agents that can't be audited get shut down by legal. Agents without permission boundaries wreak havoc in staging. Agents without human escalation paths quietly make consequential mistakes at scale.

The uncomfortable truth is that most agent deployments fail not because the model underperforms, but because the scaffolding around it lacks structure. Nearly two-thirds of organizations are experimenting with agents; fewer than one in four have successfully scaled to production. The gap isn't model quality. It's governance.