Skip to main content

30 posts tagged with "testing"

View all tags

The Composition Testing Gap: Why Your Agents Pass Every Test but Fail Together

· 9 min read
Tian Pan
Software Engineer

Your planner agent passes its eval suite at 94%. Your researcher agent scores even higher. Your synthesizer agent nails every benchmark you throw at it. You compose them into a pipeline, deploy to production, and watch it produce confidently wrong answers that no individual agent would ever generate on its own.

This is the composition testing gap — the systematic blind spot where individually validated agents fail in ways that no single-agent analysis can predict. Research on multi-agent LLM systems shows that 67% of production failures stem from inter-agent interactions rather than individual agent defects. You're testing the atoms but shipping the molecule, and molecular behavior is not the sum of atomic properties.

How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away

· 11 min read
Tian Pan
Software Engineer

Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.

The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.

The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.

Non-Deterministic CI for Agentic Systems: Why Binary Pass/Fail Breaks and What Replaces It

· 9 min read
Tian Pan
Software Engineer

Your CI pipeline assumes something that hasn't been true since you added an LLM call: that running the same code twice produces the same result. Traditional CI was built for deterministic software — compile, run tests, get a green or red light. Traditional ML evaluation was built for fixed input-output mappings — run inference on a test set, compute accuracy. Agentic AI breaks both assumptions simultaneously, and the result is a CI system that either lies to you or blocks every merge with false negatives.

The core problem isn't that agents are hard to test. It's that the testing infrastructure you already have was designed for a world where non-determinism is a bug, not a feature. When your agent takes a different tool-call path to the same correct answer on consecutive runs, a deterministic assertion fails. When it produces a semantically equivalent but lexically different response, string comparison flags a regression. The testing framework itself becomes the source of noise.

Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

· 10 min read
Tian Pan
Software Engineer

A team built an AI research assistant using Claude. They iterated on the prompt for three weeks, demo'd it to stakeholders, and launched it feeling confident. Two months later they discovered that the assistant had been silently hallucinating citations across roughly 30% of outputs — a failure mode no one had tested for because the eval suite was built after the prompt had already "felt right" in demos.

This pattern is the rule, not the exception. The LLM development industry has largely adopted test-driven development vocabulary — evals, regression suites, golden datasets, LLM-as-judge — while ignoring the most important rule TDD establishes: write the test before the implementation, not after.

Here is how to do that correctly, and the three places where the TDD analogy breaks down so badly that following it literally will make your system worse.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Your AI Product Needs Evals

· 8 min read
Tian Pan
Software Engineer

Every AI product demo looks great. The model generates something plausible, the stakeholders nod along, and everyone leaves the meeting feeling optimistic. Then the product ships, real users appear, and things start going sideways in ways nobody anticipated. The team scrambles to fix one failure mode, inadvertently creates another, and after weeks of whack-a-mole, the prompt has grown into a 2,000-token monster that nobody fully understands anymore.

The root cause is almost always the same: no evaluation system. Teams that ship reliable AI products build evals early and treat them as infrastructure, not an afterthought. Teams that stall treat evaluation as something to worry about "once the product is more mature." By then, they're already stuck.