Skip to main content

6 posts tagged with "evaluation"

View all tags

In Defense of AI Evals, for Everyone

· 7 min read
Tian Pan
Software Engineer

Every few months, a new wave of "don't bother with evals" takes hold in the AI engineering community. The argument usually goes: evals are too expensive, too brittle, too hard to define, and ultimately not worth the overhead for a fast-moving product team. Ship, iterate, and trust your instincts.

This is bad advice that produces bad software. A 2026 LangChain survey found that only 52% of organizations run offline evaluations and just 37% run online evals against live traffic — yet 32% cite quality as their number one barrier to production deployment. That is not a coincidence.

The Unglamorous Work Behind Rapidly Improving AI Products

· 9 min read
Tian Pan
Software Engineer

Most AI teams hit the same wall six weeks after launch. Initial demos were impressive, the prototype shipped on time, and early users said nice things. Then the gap between "good enough to show" and "good enough to keep" becomes unavoidable. The team scrambles — tweaking prompts, swapping models, adding guardrails — and the product barely moves.

The teams that actually improve quickly share one counterintuitive habit: they spend less time on architecture and more time staring at data. Not dashboards. Not aggregate metrics. The raw, ugly, individual failures that live inside conversation logs.

This is a field guide to the practices that separate fast-moving AI teams from ones that stay stuck.

LLM-as-a-Judge: A Practical Guide to Building Evaluators That Actually Work

· 9 min read
Tian Pan
Software Engineer

Most AI teams are measuring the wrong things, in the wrong way, with the wrong people involved. The typical evaluation setup looks like this: a 1-to-5 Likert scale, a handful of examples, and a junior engineer running the numbers. Then someone builds an LLM judge to automate it—and wonders why the whole thing feels broken six months later.

LLM-as-a-judge is a powerful pattern when done right. But "done right" is doing a lot of work in that sentence. This post is a concrete guide to building evaluators that correlate with real quality, catch real regressions, and survive contact with production.

Hard-Won Lessons from Shipping LLM Systems to Production

· 7 min read
Tian Pan
Software Engineer

Most engineers building with LLMs share a common arc: a working demo in two days, production chaos six weeks later. The technology behaves differently under real load, with real users, against real data. The lessons that emerge aren't philosophical—they're operational.

After watching teams across companies ship (and sometimes abandon) LLM-powered products, a handful of patterns appear again and again. These aren't edge cases. They're the default experience.

Building LLM Applications for Production: What Actually Breaks

· 9 min read
Tian Pan
Software Engineer

Most LLM demos work. Most LLM applications in production don't—at least not reliably. The gap between a compelling prototype and something that survives real user traffic is wider than any other software category I've worked with, and the failures are rarely where you expect them.

This is a guide to the parts that break: cost, consistency, composition, and evaluation. Not theory—the concrete problems that cause teams to quietly shelve projects three months after their first successful demo.

Common Pitfalls When Building Generative AI Applications

· 10 min read
Tian Pan
Software Engineer

Most generative AI projects fail — not because the models are bad, but because teams make the same predictable mistakes at every layer of the stack. A 2025 industry analysis found that 42% of companies abandoned most of their AI initiatives, and 95% of generative AI pilots yielded no measurable business impact. These aren't model failures. They're engineering and product failures that teams could have avoided.

This post catalogs the pitfalls that kill AI projects most reliably — from problem selection through evaluation — with specific examples from production systems.