Skip to main content

311 posts tagged with "ai-agents"

View all tags

Token Economics for AI Agents: Cutting Costs Without Cutting Corners

· 10 min read
Tian Pan
Software Engineer

A Shopify-scale merchant assistant handling 10 million conversations per day costs $2.1 million per month without optimization — or $450,000 per month with it. That 78% gap isn't from algorithmic breakthroughs; it's from caching, routing, and a few engineering disciplines that most teams skip until the invoice arrives.

AI agents are not chatbots with extra steps. A single user request triggers planning, tool selection, execution, verification, and often retry loops — consuming roughly 5x more tokens than a direct chat interaction. A ReAct loop running 10 cycles can consume 50x tokens compared to a single pass. At frontier model prices, that math becomes a liability fast.

This post covers the mechanics of where agent costs come from and the concrete techniques — with numbers — that actually move the needle.

AlphaEvolve's Architecture: How Evolutionary Search + LLMs Discovered a Better Matrix Algorithm

· 9 min read
Tian Pan
Software Engineer

In 1969, Volker Strassen published an algorithm to multiply 4×4 matrices using fewer scalar multiplications than the naive approach. For 56 years, no one did better. Then an AI coding agent rewrote it — in production, deployed across Google's global infrastructure — not by being smarter than a human mathematician, but by running a loop: generate a variant, evaluate it, keep what works, repeat.

That loop is the point. The LLM is just one piece. The architecture around it is what made AlphaEvolve work, and understanding that architecture tells you something important about where AI-assisted engineering is heading.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Context Engineering: The Invisible Architecture of Production AI Agents

· 10 min read
Tian Pan
Software Engineer

Most AI agent bugs are not model bugs. The model is doing exactly what it's told—it's what you're putting into the context that's broken. After a certain point in an agent's execution, the problem isn't capability. It's entropy: the slow accumulation of noise, redundancy, and misaligned attention that degrades every output the model produces. Researchers call this context rot, and every major model—GPT-4.1, Claude Opus 4, Gemini 2.5—exhibits it, at every input length increment, without exception.

Context engineering is the discipline of managing this problem deliberately. It's broader than prompt engineering, which is mostly about the static system prompt. Context engineering covers everything the model sees at inference time: what you include, what you exclude, what you compress, where you position things, and how you preserve cache state across a long-running task.

Building a Multi-Agent Research System: Patterns from Production

· 8 min read
Tian Pan
Software Engineer

When a single-agent system fails at a research task, the instinct is to add more memory, better tools, or a smarter model. But there's a point where the problem isn't capability — it's concurrency. Deep research tasks require pursuing multiple threads simultaneously: validating claims from different angles, scanning sources across domains, cross-referencing findings in real time. A single agent doing this sequentially is like a researcher reading every book one at a time before taking notes. The multi-agent alternative feels obvious in retrospect, but getting it right in production is considerably harder than the architecture diagram suggests.

This post is about how multi-agent research systems actually get built — the architectural choices that work, the failure modes that aren't obvious until you're in production, and the engineering discipline required to keep them useful at scale.

Why Your AI Agent Wastes Most of Its Context Window on Tools

· 10 min read
Tian Pan
Software Engineer

You connect your agent to 50 MCP tools. It can query databases, call APIs, read files, send emails, browse the web. On paper, it has everything it needs. In practice, half your production incidents trace back to tool use—wrong parameters, blown context budgets, cascading retry loops that cost ten times what you expected.

Here's the part most tutorials skip: every tool definition you load is a token tax paid upfront, before the agent processes a single user message. With 50+ tools connected, definitions alone can consume 70,000–130,000 tokens per request. That's not a corner case—it's the default state of any agent connected to multiple MCP servers.

Why Your AI Agent Should Write Code Instead of Calling Tools

· 11 min read
Tian Pan
Software Engineer

Most AI agents are expensive because of a subtle architectural mistake: they treat every intermediate result as a message to be fed back into the model. Each tool call becomes a round trip through the LLM's context window, and by the time a moderately complex task completes, you've paid to process the same data five, ten, maybe twenty times. A single 2-hour sales transcript passed between three analysis tools might cost you 50,000 tokens — not for the analysis, just for the routing.

There's a better way. When agents write and execute code rather than calling tools one at a time, intermediate results stay in the execution environment, not the context window. The model sees summaries and filtered outputs, not raw data. The difference isn't incremental — it's been measured at 98–99% token reductions on real workloads.

Why Multi-Agent AI Architectures Keep Failing (and What to Build Instead)

· 8 min read
Tian Pan
Software Engineer

Most teams that build multi-agent systems hit the same wall: the thing works in demos and falls apart in production. Not because they implemented the coordination protocol wrong. Because the protocol itself is the problem.

Multi-agent AI has an intuitive appeal. Complex tasks should be broken into parallel workstreams. Specialized agents should handle specialized work. The orchestrator ties it together and the whole becomes greater than the sum of its parts. This intuition is wrong — or more precisely, it's premature. The practical failure rates of multi-agent systems in production range from 41% to 86.7% across studied execution traces. That's not a tuning problem. That's a structural one.

12-Factor Agents: A Framework for Building AI Systems That Actually Ship

· 11 min read
Tian Pan
Software Engineer

The teams actually shipping reliable AI agents to production customers are mostly not using agent frameworks. They rolled their own.

That observation, surfaced from conversations with 100+ technical founders, is the uncomfortable starting point for the 12-Factor Agents framework — a manifesto for building LLM-powered software that reaches production instead of languishing at 80% quality forever. The framework is named deliberately after the original 12-Factor App methodology that shaped a generation of web services. The analogy holds: just as the 12-factor app gave teams a principled approach to building deployable web services, 12-factor agents provides the principles for building reliable, observable AI systems.

The 19,000-star GitHub repository documents what the best-performing production teams figured out independently. Here is what they know.

The Lethal Trifecta: Why Your AI Agent Is One Email Away from a Data Breach

· 9 min read
Tian Pan
Software Engineer

In June 2025, a researcher sent a carefully crafted email to a Microsoft 365 Copilot user. No link was clicked. No attachment opened. The email arrived, Copilot read it during a routine summarization task, and within seconds the AI began exfiltrating files from OneDrive, SharePoint, and Teams — silently transmitting contents to an attacker-controlled server by encoding data into image URLs it asked to "render." The victim never knew it happened.

This wasn't a novel zero-day in the traditional sense. There was no buffer overflow, no SQL injection. The vulnerability was architectural: the system combined three capabilities that, individually, seem like obvious product features. Together, they form what's now called the Lethal Trifecta.

Why Long-Running AI Agents Break in Production (And the Infrastructure to Fix It)

· 9 min read
Tian Pan
Software Engineer

Most AI agent demos work beautifully.

They run in under 30 seconds, hit three tools, and return a clean result. Then someone asks the agent to do something that actually matters — cross-reference a codebase, run a multi-stage data pipeline, process a batch of documents — and the whole thing falls apart in a cascade of timeouts, partial state, and duplicate side effects.

The problem is not the model. It is the infrastructure. Agents that run for minutes or hours face a completely different class of systems problems than agents that finish in seconds, and most teams hit this wall at the worst possible time: after they have already shipped something users depend on.

Context Engineering: Why What You Feed the LLM Matters More Than How You Ask

· 11 min read
Tian Pan
Software Engineer

Most LLM quality problems aren't prompt problems. They're context problems.

You spend hours crafting the perfect system prompt. You add XML tags, chain-of-thought instructions, and careful persona definitions. You test it on a handful of inputs and it looks great. Then you ship it, and two weeks later you're staring at a ticket where the agent confidently told a user the wrong account balance — because it retrieved the previous user's transaction history. The model understood the instructions perfectly. It just had the wrong inputs.

This is the core distinction between prompt engineering and context engineering. Prompt engineering asks: "How should I phrase this?" Context engineering asks: "What does the model need to know right now, and how do I make sure it gets exactly that?" One is copywriting. The other is systems architecture.