Skip to main content

32 posts tagged with "llm"

View all tags

Why Multi-Agent LLM Systems Fail (and How to Build Ones That Don't)

· 8 min read
Tian Pan
Software Engineer

Most multi-agent LLM systems deployed in production fail within weeks — not from infrastructure outages or model regressions, but from coordination problems that were baked in from the start. A comprehensive analysis of 1,642 execution traces across seven open-source frameworks found failure rates ranging from 41% to 86.7% on standard benchmarks. That's not a model quality problem. That's a systems engineering problem.

The uncomfortable finding: roughly 79% of those failures trace back to specification and coordination issues, not compute limits or model capability. You can swap in a better model and still watch your multi-agent pipeline collapse in the exact same way. Understanding why requires looking at the failure taxonomy carefully.

Tool Use in Production: Function Calling Patterns That Actually Work

· 9 min read
Tian Pan
Software Engineer

The most surprising thing about LLM function calling failures in production is where they come from. Not hallucinated reasoning. Not the model picking the wrong tool. The number one cause of agent flakiness is argument construction: wrong types, missing required fields, malformed JSON, hallucinated extra fields. The model is fine. Your schema is the problem.

This is good news, because schemas are cheap to fix.

Your AI Product Needs Evals

· 8 min read
Tian Pan
Software Engineer

Every AI product demo looks great. The model generates something plausible, the stakeholders nod along, and everyone leaves the meeting feeling optimistic. Then the product ships, real users appear, and things start going sideways in ways nobody anticipated. The team scrambles to fix one failure mode, inadvertently creates another, and after weeks of whack-a-mole, the prompt has grown into a 2,000-token monster that nobody fully understands anymore.

The root cause is almost always the same: no evaluation system. Teams that ship reliable AI products build evals early and treat them as infrastructure, not an afterthought. Teams that stall treat evaluation as something to worry about "once the product is more mature." By then, they're already stuck.

The Unglamorous Work Behind Rapidly Improving AI Products

· 9 min read
Tian Pan
Software Engineer

Most AI teams hit the same wall six weeks after launch. Initial demos were impressive, the prototype shipped on time, and early users said nice things. Then the gap between "good enough to show" and "good enough to keep" becomes unavoidable. The team scrambles — tweaking prompts, swapping models, adding guardrails — and the product barely moves.

The teams that actually improve quickly share one counterintuitive habit: they spend less time on architecture and more time staring at data. Not dashboards. Not aggregate metrics. The raw, ugly, individual failures that live inside conversation logs.

This is a field guide to the practices that separate fast-moving AI teams from ones that stay stuck.

Cloud Agents Are Rewriting How Software Gets Built

· 7 min read
Tian Pan
Software Engineer

The first time an AI coding agent broke a team's CI pipeline—not by writing bad code, but by generating pull requests faster than GitHub Actions could process them—it became clear something fundamental had shifted. We were no longer talking about a smarter autocomplete. We were talking about a different model of software production entirely.

The arc of AI-assisted coding has moved quickly. Autocomplete tools changed how individuals typed. Local agents changed what a single session could accomplish. Cloud agents are now changing how teams build software—parallelizing work across multiple asynchronous threads, running tests before handing off PRs, and increasingly handling 3-hour tasks while developers sleep or move on to other problems.

LLM-as-a-Judge: A Practical Guide to Building Evaluators That Actually Work

· 9 min read
Tian Pan
Software Engineer

Most AI teams are measuring the wrong things, in the wrong way, with the wrong people involved. The typical evaluation setup looks like this: a 1-to-5 Likert scale, a handful of examples, and a junior engineer running the numbers. Then someone builds an LLM judge to automate it—and wonders why the whole thing feels broken six months later.

LLM-as-a-judge is a powerful pattern when done right. But "done right" is doing a lot of work in that sentence. This post is a concrete guide to building evaluators that correlate with real quality, catch real regressions, and survive contact with production.

A Year of Building with LLMs: What the Field Has Actually Learned

· 9 min read
Tian Pan
Software Engineer

Most teams building with LLMs today are repeating mistakes that others made a year ago. The most expensive one is mistaking the model for the product.

After a year of LLM-powered systems shipping into production — codegen tools, document processors, customer-facing assistants, internal knowledge systems — practitioners have accumulated a body of hard-won knowledge that's very different from what the hype cycle suggests. The lessons aren't about which foundation model to choose or whether RAG beats finetuning. They're about the unglamorous work of building reliable systems: how to evaluate output, how to structure workflows, when to invest in infrastructure versus when to keep iterating on prompts, and how to think about differentiation.

This is a synthesis of what that field experience actually shows.

How AI Agents Actually Work: Architecture, Planning, and Failure Modes

· 10 min read
Tian Pan
Software Engineer

Most agent failures are architecture failures. The model gets blamed when a task goes sideways, but nine times out of ten, the real problem is that nobody thought hard enough about how planning, tool use, and reflection should fit together. You can swap in a better model and still get the same crashes — because the scaffolding around the model was never designed to handle what the model was being asked to do.

This post is a practical guide to how agents actually work under the hood: what the core components are, where plans go wrong, how reflection loops help (and when they hurt), and what multi-agent systems look like when you're building them for production rather than demos.

Hard-Won Lessons from Shipping LLM Systems to Production

· 7 min read
Tian Pan
Software Engineer

Most engineers building with LLMs share a common arc: a working demo in two days, production chaos six weeks later. The technology behaves differently under real load, with real users, against real data. The lessons that emerge aren't philosophical—they're operational.

After watching teams across companies ship (and sometimes abandon) LLM-powered products, a handful of patterns appear again and again. These aren't edge cases. They're the default experience.

Building LLM Applications for Production: What Actually Breaks

· 9 min read
Tian Pan
Software Engineer

Most LLM demos work. Most LLM applications in production don't—at least not reliably. The gap between a compelling prototype and something that survives real user traffic is wider than any other software category I've worked with, and the failures are rarely where you expect them.

This is a guide to the parts that break: cost, consistency, composition, and evaluation. Not theory—the concrete problems that cause teams to quietly shelve projects three months after their first successful demo.

LLM-Powered Autonomous Agents: The Architecture Behind Real Autonomy

· 8 min read
Tian Pan
Software Engineer

Most teams that claim to have "agents in production" don't. Surveys consistently show that around 57% of engineering organizations have deployed AI agents — but when you apply rigorous criteria (the LLM must plan, act, observe feedback, and adapt based on results), only 16% of enterprise deployments and 27% of startup deployments qualify as true agents. The rest are glorified chatbots with tool calls bolted on.

This gap isn't about model capability. It's about architecture. Genuine autonomous agents require three interlocking subsystems working in concert: planning, memory, and tool use. Most implementations get one right, partially implement a second, and ignore the third. The result is a system that works beautifully in demos and fails unpredictably in production.

Seven Patterns for Building LLM Systems That Actually Work in Production

· 10 min read
Tian Pan
Software Engineer

The demo always works. Prompt the model with a curated example, get a clean output, ship the screenshot to the stakeholder deck. Six weeks later, the system is in front of real users, and none of the demo examples appear in production traffic.

This is the gap every LLM product team eventually crosses: the jump from "it works on my inputs" to "it works on inputs I didn't anticipate." The patterns that close that gap aren't about model selection or prompt cleverness — they're about system design. Seven patterns account for most of what separates functional prototypes from reliable production systems.