Skip to main content

56 posts tagged with "agents"

View all tags

Why Your Agent Should Write Code, Not JSON

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.

There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.

LLM Observability in Production: Tracing What You Can't Predict

· 10 min read
Tian Pan
Software Engineer

Your monitoring stack tells you everything about request rates, CPU, and database latency. It tells you almost nothing about whether your LLM just hallucinated a refund policy, why a customer-facing agent looped through three tool calls to answer a simple question, or which feature in your product is quietly burning $800 a day in tokens.

Traditional observability was built around deterministic systems. LLMs are structurally different — same input, different output, every time. The failure mode isn't a 500 error or a timeout; it's a confident, plausible-sounding answer that happens to be wrong. The cost isn't steady and predictable; it spikes when a single misconfigured prompt hits a traffic wave. Debugging isn't "find the exception in the stack trace"; it's "reconstruct why the agent chose this tool path at 2 AM on Tuesday."

This is the problem LLM observability solves — and the discipline has matured significantly over the past 18 months.

Tool Calling in Production: The Loop, the Pitfalls, and What Actually Works

· 9 min read
Tian Pan
Software Engineer

The first time your agent silently retries the same broken tool call three times before giving up, you realize that "just add tools" is not a production strategy. Tool calling unlocks genuine capabilities — external data, side effects, guaranteed-shape outputs — but the agentic loop that makes it work has sharp edges that don't show up in demos.

This post is about those edges: how the loop actually runs, the formatting rules that quietly destroy parallel execution, how to write tool descriptions that make the model choose correctly, and how to handle errors in a way that lets the model recover instead of spiral.

AI Agent Architecture: What Actually Works in Production

· 11 min read
Tian Pan
Software Engineer

One company shipped 7,949 AI agents. Fifteen percent of them worked. The rest failed silently, looped endlessly, or contradicted themselves mid-task. This is not a fringe result — enterprise analyses consistently find that 88% of AI agent projects never reach production, and 95% of generative AI pilots fail or severely underperform. The gap between a compelling demo and a reliable system is not a model problem. It is an architecture problem.

The engineers who are shipping agents that actually work have converged on a set of structural decisions that look nothing like the toy examples in framework tutorials. This post is about those decisions: where the layers are, where failures concentrate, and why the hardest problems are not about prompts.

Tool Use in Production: Function Calling Patterns That Actually Work

· 9 min read
Tian Pan
Software Engineer

The most surprising thing about LLM function calling failures in production is where they come from. Not hallucinated reasoning. Not the model picking the wrong tool. The number one cause of agent flakiness is argument construction: wrong types, missing required fields, malformed JSON, hallucinated extra fields. The model is fine. Your schema is the problem.

This is good news, because schemas are cheap to fix.

LLM-as-a-Judge: A Practical Guide to Building Evaluators That Actually Work

· 9 min read
Tian Pan
Software Engineer

Most AI teams are measuring the wrong things, in the wrong way, with the wrong people involved. The typical evaluation setup looks like this: a 1-to-5 Likert scale, a handful of examples, and a junior engineer running the numbers. Then someone builds an LLM judge to automate it—and wonders why the whole thing feels broken six months later.

LLM-as-a-judge is a powerful pattern when done right. But "done right" is doing a lot of work in that sentence. This post is a concrete guide to building evaluators that correlate with real quality, catch real regressions, and survive contact with production.

Common Pitfalls When Building Generative AI Applications

· 10 min read
Tian Pan
Software Engineer

Most generative AI projects fail — not because the models are bad, but because teams make the same predictable mistakes at every layer of the stack. A 2025 industry analysis found that 42% of companies abandoned most of their AI initiatives, and 95% of generative AI pilots yielded no measurable business impact. These aren't model failures. They're engineering and product failures that teams could have avoided.

This post catalogs the pitfalls that kill AI projects most reliably — from problem selection through evaluation — with specific examples from production systems.

Building Effective AI Agents: Patterns That Actually Work in Production

· 9 min read
Tian Pan
Software Engineer

Most AI agent projects fail not because the models aren't capable enough — but because the engineers building them reach for complexity before they've earned it. After studying dozens of production deployments, a clear pattern emerges: the teams shipping reliable agents start with the simplest possible system and add complexity only when metrics demand it.

This is a guide to the mental models, patterns, and practical techniques that separate robust agentic systems from ones that hallucinate, loop, and fall apart under real workloads.