Skip to main content

340 posts tagged with "llm"

View all tags

The Agent Evaluation Readiness Checklist

· 9 min read
Tian Pan
Software Engineer

Most teams building AI agents make the same mistake: they start with the evaluation infrastructure before they understand what failure looks like. They instrument dashboards, choose metrics, wire up graders — and then discover their evals are measuring the wrong things entirely. Six weeks in, they have a green scorecard and a broken agent.

The fix is not more tooling. It is a specific sequence of steps that grounds your evaluation in reality before you automate anything. Here is that sequence.

The Anatomy of an Agent Harness

· 9 min read
Tian Pan
Software Engineer

Most engineers building AI agents spend 80% of their time thinking about which model to use and 20% thinking about everything else. That ratio should be flipped. The model is almost interchangeable at this point — the harness is what determines whether your agent actually works in production.

The equation is simple: Agent = Model + Harness. If you're not the model, you're the harness. And the harness is where nearly all the real engineering lives.

Routines and Handoffs: The Two Primitives Behind Every Reliable Multi-Agent System

· 8 min read
Tian Pan
Software Engineer

Most multi-agent systems fail not because the models are wrong, but because the plumbing is leaky. Agents drop context mid-task, hand off to the wrong specialist, or loop indefinitely when they don't know how to exit. The underlying cause is almost always the same: the system was designed around what each agent can do, without clearly defining how work moves between them.

Two primitives fix most of this: routines and handoffs. They're deceptively simple, but getting them right is the difference between a demo that works and a system you can ship.

Building Effective AI Agents: Patterns That Actually Work in Production

· 9 min read
Tian Pan
Software Engineer

Most AI agent projects fail not because the models aren't capable enough — but because the engineers building them reach for complexity before they've earned it. After studying dozens of production deployments, a clear pattern emerges: the teams shipping reliable agents start with the simplest possible system and add complexity only when metrics demand it.

This is a guide to the mental models, patterns, and practical techniques that separate robust agentic systems from ones that hallucinate, loop, and fall apart under real workloads.