Skip to main content

207 posts tagged with "llm"

View all tags

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Context Engineering: The Invisible Architecture of Production AI Agents

· 10 min read
Tian Pan
Software Engineer

Most AI agent bugs are not model bugs. The model is doing exactly what it's told—it's what you're putting into the context that's broken. After a certain point in an agent's execution, the problem isn't capability. It's entropy: the slow accumulation of noise, redundancy, and misaligned attention that degrades every output the model produces. Researchers call this context rot, and every major model—GPT-4.1, Claude Opus 4, Gemini 2.5—exhibits it, at every input length increment, without exception.

Context engineering is the discipline of managing this problem deliberately. It's broader than prompt engineering, which is mostly about the static system prompt. Context engineering covers everything the model sees at inference time: what you include, what you exclude, what you compress, where you position things, and how you preserve cache state across a long-running task.

Building a Multi-Agent Research System: Patterns from Production

· 8 min read
Tian Pan
Software Engineer

When a single-agent system fails at a research task, the instinct is to add more memory, better tools, or a smarter model. But there's a point where the problem isn't capability — it's concurrency. Deep research tasks require pursuing multiple threads simultaneously: validating claims from different angles, scanning sources across domains, cross-referencing findings in real time. A single agent doing this sequentially is like a researcher reading every book one at a time before taking notes. The multi-agent alternative feels obvious in retrospect, but getting it right in production is considerably harder than the architecture diagram suggests.

This post is about how multi-agent research systems actually get built — the architectural choices that work, the failure modes that aren't obvious until you're in production, and the engineering discipline required to keep them useful at scale.

Designing an Agent Runtime from First Principles

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks make a critical mistake early: they treat the agent as a function. You call it, it loops, it returns. That mental model works for demos. It falls apart the moment a real-world task runs for 45 minutes, hits a rate limit at step 23, and you have nothing to resume from.

A production agent runtime is not a function runner. It is an execution substrate — something closer to a process scheduler or a distributed workflow engine than a Python function. Getting this distinction right from the beginning determines whether your agent system handles failures gracefully or requires a human to hit retry.

Why Your Agent Should Write Code, Not JSON

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.

There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.

Why Your AI Agent Wastes Most of Its Context Window on Tools

· 10 min read
Tian Pan
Software Engineer

You connect your agent to 50 MCP tools. It can query databases, call APIs, read files, send emails, browse the web. On paper, it has everything it needs. In practice, half your production incidents trace back to tool use—wrong parameters, blown context budgets, cascading retry loops that cost ten times what you expected.

Here's the part most tutorials skip: every tool definition you load is a token tax paid upfront, before the agent processes a single user message. With 50+ tools connected, definitions alone can consume 70,000–130,000 tokens per request. That's not a corner case—it's the default state of any agent connected to multiple MCP servers.

Why Multi-Agent AI Architectures Keep Failing (and What to Build Instead)

· 8 min read
Tian Pan
Software Engineer

Most teams that build multi-agent systems hit the same wall: the thing works in demos and falls apart in production. Not because they implemented the coordination protocol wrong. Because the protocol itself is the problem.

Multi-agent AI has an intuitive appeal. Complex tasks should be broken into parallel workstreams. Specialized agents should handle specialized work. The orchestrator ties it together and the whole becomes greater than the sum of its parts. This intuition is wrong — or more precisely, it's premature. The practical failure rates of multi-agent systems in production range from 41% to 86.7% across studied execution traces. That's not a tuning problem. That's a structural one.

Building a Generative AI Platform: Architecture, Trade-offs, and the Components That Actually Matter

· 12 min read
Tian Pan
Software Engineer

Most teams treating their GenAI stack as a model integration project eventually discover they've actually built—or need to build—a platform. The model is the easy part. The hard part is everything around it: routing queries to the right model, retrieving context reliably, filtering unsafe outputs, caching redundant calls, tracing what went wrong in a chain of five LLM calls, and keeping costs from tripling month-over-month as usage scales.

This article is about that platform layer. Not the model weights, not the prompts—the surrounding infrastructure that separates a working proof of concept from something you'd trust to serve a million users.

12-Factor Agents: A Framework for Building AI Systems That Actually Ship

· 11 min read
Tian Pan
Software Engineer

The teams actually shipping reliable AI agents to production customers are mostly not using agent frameworks. They rolled their own.

That observation, surfaced from conversations with 100+ technical founders, is the uncomfortable starting point for the 12-Factor Agents framework — a manifesto for building LLM-powered software that reaches production instead of languishing at 80% quality forever. The framework is named deliberately after the original 12-Factor App methodology that shaped a generation of web services. The analogy holds: just as the 12-factor app gave teams a principled approach to building deployable web services, 12-factor agents provides the principles for building reliable, observable AI systems.

The 19,000-star GitHub repository documents what the best-performing production teams figured out independently. Here is what they know.

The Lethal Trifecta: Why Your AI Agent Is One Email Away from a Data Breach

· 9 min read
Tian Pan
Software Engineer

In June 2025, a researcher sent a carefully crafted email to a Microsoft 365 Copilot user. No link was clicked. No attachment opened. The email arrived, Copilot read it during a routine summarization task, and within seconds the AI began exfiltrating files from OneDrive, SharePoint, and Teams — silently transmitting contents to an attacker-controlled server by encoding data into image URLs it asked to "render." The victim never knew it happened.

This wasn't a novel zero-day in the traditional sense. There was no buffer overflow, no SQL injection. The vulnerability was architectural: the system combined three capabilities that, individually, seem like obvious product features. Together, they form what's now called the Lethal Trifecta.

LLM Observability in Production: Tracing What You Can't Predict

· 10 min read
Tian Pan
Software Engineer

Your monitoring stack tells you everything about request rates, CPU, and database latency. It tells you almost nothing about whether your LLM just hallucinated a refund policy, why a customer-facing agent looped through three tool calls to answer a simple question, or which feature in your product is quietly burning $800 a day in tokens.

Traditional observability was built around deterministic systems. LLMs are structurally different — same input, different output, every time. The failure mode isn't a 500 error or a timeout; it's a confident, plausible-sounding answer that happens to be wrong. The cost isn't steady and predictable; it spikes when a single misconfigured prompt hits a traffic wave. Debugging isn't "find the exception in the stack trace"; it's "reconstruct why the agent chose this tool path at 2 AM on Tuesday."

This is the problem LLM observability solves — and the discipline has matured significantly over the past 18 months.

The Hidden Costs of Context: Managing Token Budgets in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications for the first time make the same mistake: they treat context windows as free storage. The model supports 128K tokens? Great, pack it full. The model supports 1M tokens? Even better — dump everything in. What follows is a billing shock that arrives about three weeks before the product actually works well.

Context is not free. It's not even cheap. And beyond cost, blindly filling a context window actively makes your model worse. A focused 300-token context frequently outperforms an unfocused 113,000-token context. This is not an edge case — it's a documented failure mode with a name: "lost in the middle." Managing context well is one of the highest-leverage engineering decisions you'll make on an LLM product.