Skip to main content

269 posts tagged with "ai-engineering"

View all tags

The Eval-to-Production Gap: Why 92% on Your Test Suite Means 40% User Satisfaction

· 10 min read
Tian Pan
Software Engineer

Your team spent three weeks building a rigorous eval suite. It covers edge cases. It includes adversarial examples. The LLM-as-judge scores 92% across all dimensions. You ship.

Then the support tickets start. Users say the AI "doesn't understand what they're asking." Session abandonment is up 30%. Satisfaction scores come back at 41%.

This gap — between eval performance and real-world outcomes — is the most common failure mode in production AI systems today. It's not a model problem. It's a measurement problem.

Temporal Reasoning Failures in Production AI Systems

· 10 min read
Tian Pan
Software Engineer

An agent that confidently recommends products that have been out of stock for six months. A customer service bot that tells a user there's no record of the order they placed 20 minutes ago. A coding assistant that generates working code against a library API deprecated two years ago. These aren't hallucinations in the traditional sense — the model is recalling something that was once accurate. That's a different failure mode entirely, and most teams aren't equipped to detect or defend against it.

The distinction matters because the mitigations are fundamentally different. You can't prompt-engineer your way out of staleness. You can't fine-tune your way out of it either — fine-tuning on stale knowledge makes the problem worse, not better, because the model expresses outdated information with greater authority. And as models become more fluent and confident in their delivery, their confidently-wrong stale answers become harder, not easier, for users to catch.

Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

· 10 min read
Tian Pan
Software Engineer

A team built an AI research assistant using Claude. They iterated on the prompt for three weeks, demo'd it to stakeholders, and launched it feeling confident. Two months later they discovered that the assistant had been silently hallucinating citations across roughly 30% of outputs — a failure mode no one had tested for because the eval suite was built after the prompt had already "felt right" in demos.

This pattern is the rule, not the exception. The LLM development industry has largely adopted test-driven development vocabulary — evals, regression suites, golden datasets, LLM-as-judge — while ignoring the most important rule TDD establishes: write the test before the implementation, not after.

Here is how to do that correctly, and the three places where the TDD analogy breaks down so badly that following it literally will make your system worse.

The Anatomy of an Agent Harness

· 8 min read
Tian Pan
Software Engineer

There's a 100-line Python agent that scores 74–76% on SWE-bench Verified — only 4–6 percentage points behind state-of-the-art systems built by well-funded teams. The execution loop itself isn't where the complexity lives. World-class teams invest six to twelve months building the infrastructure around that loop. That infrastructure has a name: the harness.

The formula is simple: Agent = Model + Harness. The model handles reasoning. The harness handles everything else — tool execution, context management, safety enforcement, error recovery, state persistence, and human-in-the-loop workflows. If you've been spending months optimizing prompts and model selection while shipping brittle agents, you've been optimizing the wrong thing.

LLM Evals: What Actually Works and What Wastes Your Time

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications fall into one of two failure modes. The first is building no evals at all and shipping features on vibes. The second is building elaborate evaluation infrastructure before they understand what they're actually trying to measure. Both are expensive mistakes.

The teams that do evals well share a common approach: they start by looking at data, not by building systems. Error analysis comes before evaluation automation. Human judgment grounds the metrics before any automated judge is trusted. And they treat evaluation not as a milestone to cross but as a continuous discipline that evolves alongside the product.

This is what evals actually look like in practice — the decisions that matter, the patterns that waste effort, and the tradeoffs that aren't obvious until you've been burned.

Why Your LLM Evaluators Are Miscalibrated — and the Data-First Fix

· 9 min read
Tian Pan
Software Engineer

Most teams build their LLM evaluators in the wrong order. They write criteria, then look at data. That inversion is the root cause of miscalibrated evals, and it's almost universal in teams shipping their first AI product. The criteria sound reasonable on paper — "the response should be accurate, helpful, and concise" — but when you apply them to real model outputs, you discover the rubric doesn't match what you actually care about. You end up with an evaluator that grades things you're not measuring and misses failures that matter.

The fix isn't a better rubric. It's a different workflow: look at the data first, define criteria second, and then validate your evaluator against human judgment before trusting it to run unsupervised.

Eval Engineering for Production LLM Systems

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM systems start with the wrong question. They ask "how do I evaluate this?" before understanding what actually breaks. Then they spend weeks building eval infrastructure that measures the wrong things, achieve 90%+ pass rates immediately, and ship products that users hate. The evaluations weren't wrong—they just weren't measuring failure.

Effective eval engineering isn't primarily about infrastructure. It's about developing a precise, shared understanding of what "good" means for your specific system. The infrastructure is almost incidental. In mature LLM teams, 60–80% of development time goes toward error analysis and evaluation—not feature work. That ratio surprises most engineers until they've shipped a broken model to production and spent a week debugging what went wrong.

Designing an Agent Runtime from First Principles

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks make a critical mistake early: they treat the agent as a function. You call it, it loops, it returns. That mental model works for demos. It falls apart the moment a real-world task runs for 45 minutes, hits a rate limit at step 23, and you have nothing to resume from.

A production agent runtime is not a function runner. It is an execution substrate — something closer to a process scheduler or a distributed workflow engine than a Python function. Getting this distinction right from the beginning determines whether your agent system handles failures gracefully or requires a human to hit retry.

Why Your Agent Should Write Code, Not JSON

· 10 min read
Tian Pan
Software Engineer

Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.

There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.

Building a Generative AI Platform: Architecture, Trade-offs, and the Components That Actually Matter

· 12 min read
Tian Pan
Software Engineer

Most teams treating their GenAI stack as a model integration project eventually discover they've actually built—or need to build—a platform. The model is the easy part. The hard part is everything around it: routing queries to the right model, retrieving context reliably, filtering unsafe outputs, caching redundant calls, tracing what went wrong in a chain of five LLM calls, and keeping costs from tripling month-over-month as usage scales.

This article is about that platform layer. Not the model weights, not the prompts—the surrounding infrastructure that separates a working proof of concept from something you'd trust to serve a million users.

Prompt Engineering Deep Dive: From Basics to Advanced Techniques

· 10 min read
Tian Pan
Software Engineer

Most engineers treat prompts as magic words — tweak a phrase, hope it works, move on. That works fine for demos. In production, it produces a system where nobody knows why the model behaves differently on Tuesday than on Monday, and where a routine model update silently breaks three features. Prompt engineering done right is a discipline, not a ritual. This post covers the full stack: when to use each technique, what the benchmarks actually show, and where the traps are.

What AI Benchmarks Actually Measure (And Why You Shouldn't Trust the Leaderboard)

· 10 min read
Tian Pan
Software Engineer

When GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B all score 88–93% on MMLU, what does that number actually tell you about which model to deploy? The uncomfortable answer: almost nothing. The benchmark that once separated capable models from mediocre ones has saturated. Every frontier model aces it, yet they behave very differently in production. The gap between benchmark performance and real-world utility has never been wider, and understanding why is now essential for any engineer building on top of LLMs.

Benchmarks feel rigorous because they produce numbers. A number looks like measurement, and measurement looks like truth. But the legitimacy of a benchmark score depends entirely on the validity of what it's measuring—and that validity breaks down in ways that are rarely surfaced on leaderboards.