Skip to main content

51 posts tagged with "ai-agents"

View all tags

Building Governed AI Agents: A Practical Guide to Agentic Scaffolding

· 10 min read
Tian Pan
Software Engineer

Most teams building AI agents spend the first month chasing performance: better prompts, smarter routing, faster retrieval. They spend the next six months chasing the thing they skipped—governance. Agents that can't be audited get shut down by legal. Agents without permission boundaries wreak havoc in staging. Agents without human escalation paths quietly make consequential mistakes at scale.

The uncomfortable truth is that most agent deployments fail not because the model underperforms, but because the scaffolding around it lacks structure. Nearly two-thirds of organizations are experimenting with agents; fewer than one in four have successfully scaled to production. The gap isn't model quality. It's governance.

Context Engineering: The Discipline That Matters More Than Prompting

· 9 min read
Tian Pan
Software Engineer

Most engineers building LLM systems spend the first few weeks obsessing over their prompts. They A/B test phrasing, argue about whether to use XML tags or JSON, and iterate on system prompt wording until the model outputs something that looks right. Then they hit production, add real data, memory, and tool calls — and the model starts misbehaving in ways that no amount of prompt tuning can fix. The problem was never the prompt.

The real bottleneck in production LLM systems is context — what information is present in the model's input, in what order, how much of it there is, and whether it's relevant to the decision the model is about to make. Context engineering is the discipline of designing and managing that input space as a first-class system concern. It subsumes prompt engineering the same way software architecture subsumes variable naming: the smaller skill still matters, but it doesn't drive outcomes at scale.

Your CLAUDE.md Is Probably Too Long (And That's Why It's Not Working)

· 10 min read
Tian Pan
Software Engineer

Here's a pattern that plays out constantly in teams adopting AI coding agents: a developer has Claude disobey a rule, so they add a clearer version to their CLAUDE.md. Claude disobeys a different rule, so they add that one too. After a few weeks, the file is 400 lines long and Claude is ignoring more rules than ever. The solution made the problem worse.

This happens because of a fundamental property of instruction files that most developers never internalize: past a certain size, adding more instructions causes the model to follow fewer of them. Getting instruction files right is less about completeness and more about ruthless selection — knowing what to include, what to cut, and how to architect the rest.

![](https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=Your%20CLAUDE.md%20Is%20Probably%20Too%20Long%20(And%20That's%20Why%20It's%20Not%20Working)

Why Your Existing Observability Stack Won't Save You When AI Agents Break

· 11 min read
Tian Pan
Software Engineer

Your Datadog dashboard shows zero errors. Latency is nominal. All services return HTTP 200. Meanwhile, your AI agent just booked a meeting in the wrong timezone, hallucinated a customer's order history, and burned $4 in tokens doing it.

This is what makes agent observability genuinely hard: the metrics you already have tell you almost nothing about whether agents are actually working.

Traditional distributed tracing was built on a set of assumptions about how software fails. LLM agents violate all of them, and the gap between "my infrastructure is healthy" and "my agent did the right thing" is where most debugging pain lives.

Token Economics for AI Agents: Cutting Costs Without Cutting Corners

· 10 min read
Tian Pan
Software Engineer

A Shopify-scale merchant assistant handling 10 million conversations per day costs $2.1 million per month without optimization — or $450,000 per month with it. That 78% gap isn't from algorithmic breakthroughs; it's from caching, routing, and a few engineering disciplines that most teams skip until the invoice arrives.

AI agents are not chatbots with extra steps. A single user request triggers planning, tool selection, execution, verification, and often retry loops — consuming roughly 5x more tokens than a direct chat interaction. A ReAct loop running 10 cycles can consume 50x tokens compared to a single pass. At frontier model prices, that math becomes a liability fast.

This post covers the mechanics of where agent costs come from and the concrete techniques — with numbers — that actually move the needle.

AlphaEvolve's Architecture: How Evolutionary Search + LLMs Discovered a Better Matrix Algorithm

· 9 min read
Tian Pan
Software Engineer

In 1969, Volker Strassen published an algorithm to multiply 4×4 matrices using fewer scalar multiplications than the naive approach. For 56 years, no one did better. Then an AI coding agent rewrote it — in production, deployed across Google's global infrastructure — not by being smarter than a human mathematician, but by running a loop: generate a variant, evaluate it, keep what works, repeat.

That loop is the point. The LLM is just one piece. The architecture around it is what made AlphaEvolve work, and understanding that architecture tells you something important about where AI-assisted engineering is heading.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Context Engineering: The Invisible Architecture of Production AI Agents

· 10 min read
Tian Pan
Software Engineer

Most AI agent bugs are not model bugs. The model is doing exactly what it's told—it's what you're putting into the context that's broken. After a certain point in an agent's execution, the problem isn't capability. It's entropy: the slow accumulation of noise, redundancy, and misaligned attention that degrades every output the model produces. Researchers call this context rot, and every major model—GPT-4.1, Claude Opus 4, Gemini 2.5—exhibits it, at every input length increment, without exception.

Context engineering is the discipline of managing this problem deliberately. It's broader than prompt engineering, which is mostly about the static system prompt. Context engineering covers everything the model sees at inference time: what you include, what you exclude, what you compress, where you position things, and how you preserve cache state across a long-running task.

Building a Multi-Agent Research System: Patterns from Production

· 8 min read
Tian Pan
Software Engineer

When a single-agent system fails at a research task, the instinct is to add more memory, better tools, or a smarter model. But there's a point where the problem isn't capability — it's concurrency. Deep research tasks require pursuing multiple threads simultaneously: validating claims from different angles, scanning sources across domains, cross-referencing findings in real time. A single agent doing this sequentially is like a researcher reading every book one at a time before taking notes. The multi-agent alternative feels obvious in retrospect, but getting it right in production is considerably harder than the architecture diagram suggests.

This post is about how multi-agent research systems actually get built — the architectural choices that work, the failure modes that aren't obvious until you're in production, and the engineering discipline required to keep them useful at scale.

Why Your AI Agent Wastes Most of Its Context Window on Tools

· 10 min read
Tian Pan
Software Engineer

You connect your agent to 50 MCP tools. It can query databases, call APIs, read files, send emails, browse the web. On paper, it has everything it needs. In practice, half your production incidents trace back to tool use—wrong parameters, blown context budgets, cascading retry loops that cost ten times what you expected.

Here's the part most tutorials skip: every tool definition you load is a token tax paid upfront, before the agent processes a single user message. With 50+ tools connected, definitions alone can consume 70,000–130,000 tokens per request. That's not a corner case—it's the default state of any agent connected to multiple MCP servers.

Why Your AI Agent Should Write Code Instead of Calling Tools

· 11 min read
Tian Pan
Software Engineer

Most AI agents are expensive because of a subtle architectural mistake: they treat every intermediate result as a message to be fed back into the model. Each tool call becomes a round trip through the LLM's context window, and by the time a moderately complex task completes, you've paid to process the same data five, ten, maybe twenty times. A single 2-hour sales transcript passed between three analysis tools might cost you 50,000 tokens — not for the analysis, just for the routing.

There's a better way. When agents write and execute code rather than calling tools one at a time, intermediate results stay in the execution environment, not the context window. The model sees summaries and filtered outputs, not raw data. The difference isn't incremental — it's been measured at 98–99% token reductions on real workloads.

Why Multi-Agent AI Architectures Keep Failing (and What to Build Instead)

· 8 min read
Tian Pan
Software Engineer

Most teams that build multi-agent systems hit the same wall: the thing works in demos and falls apart in production. Not because they implemented the coordination protocol wrong. Because the protocol itself is the problem.

Multi-agent AI has an intuitive appeal. Complex tasks should be broken into parallel workstreams. Specialized agents should handle specialized work. The orchestrator ties it together and the whole becomes greater than the sum of its parts. This intuition is wrong — or more precisely, it's premature. The practical failure rates of multi-agent systems in production range from 41% to 86.7% across studied execution traces. That's not a tuning problem. That's a structural one.