68 posts tagged with "testing"

Property-Based Testing for LLM Systems: Invariants That Hold Even When Outputs Don't

April 12, 2026 · 12 min read

Software Engineer

A product team at a fintech company shipped an LLM-powered document summarizer. Their eval dataset — 200 hand-curated examples with human ratings — scored 87% quality. In production, the system occasionally returned summaries longer than the original documents when users uploaded short memos. The eval set had no memos under 300 words. The property "output length ≤ input length for summarization tasks" was never tested. Nobody noticed until a customer screenshotted the absurdity and posted it online.

This is the fundamental gap that property-based testing (PBT) fills. Eval datasets measure accuracy on what you thought to test. Property-based tests measure whether your system obeys a contract across the entire space of what could happen.

Simulation Environments for Agent Testing: Building Sandboxes Where Consequences Are Free

April 12, 2026 · 10 min read

Tian Pan

Software Engineer

Your agent passes every test in staging. Then it hits production and sends 4,000 emails, charges a customer twice, and deletes a record it wasn't supposed to touch. The staging tests weren't wrong — they just tested the wrong things. The staging environment made the agent look safe because everything it could break was fake in the wrong way: mocked just enough to not crash, but realistic enough to fool you into thinking the test meant something.

This is the simulation fidelity trap. It's different from ordinary software testing failures. For a deterministic function, a staging environment that mirrors production schemas and APIs is usually sufficient. For an agent, behavior emerges from the interaction between reasoning, tool outputs, and accumulated state across a multi-step trajectory. A staging environment that diverges from production in any of those dimensions will produce agents that are systematically over-confident about how they'll behave under real conditions.

The Composition Testing Gap: Why Your Agents Pass Every Test but Fail Together

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Your planner agent passes its eval suite at 94%. Your researcher agent scores even higher. Your synthesizer agent nails every benchmark you throw at it. You compose them into a pipeline, deploy to production, and watch it produce confidently wrong answers that no individual agent would ever generate on its own.

This is the composition testing gap — the systematic blind spot where individually validated agents fail in ways that no single-agent analysis can predict. Research on multi-agent LLM systems shows that 67% of production failures stem from inter-agent interactions rather than individual agent defects. You're testing the atoms but shipping the molecule, and molecular behavior is not the sum of atomic properties.

How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away

April 10, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.

The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.

The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.

Non-Deterministic CI for Agentic Systems: Why Binary Pass/Fail Breaks and What Replaces It

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Your CI pipeline assumes something that hasn't been true since you added an LLM call: that running the same code twice produces the same result. Traditional CI was built for deterministic software — compile, run tests, get a green or red light. Traditional ML evaluation was built for fixed input-output mappings — run inference on a test set, compute accuracy. Agentic AI breaks both assumptions simultaneously, and the result is a CI system that either lies to you or blocks every merge with false negatives.

The core problem isn't that agents are hard to test. It's that the testing infrastructure you already have was designed for a world where non-determinism is a bug, not a feature. When your agent takes a different tool-call path to the same correct answer on consecutive runs, a deterministic assertion fails. When it produces a semantically equivalent but lexically different response, string comparison flags a regression. The testing framework itself becomes the source of noise.

Test-Driven Development for LLM Applications: Where the Analogy Holds and Where It Breaks

March 12, 2026 · 10 min read

Tian Pan

Software Engineer

A team built an AI research assistant using Claude. They iterated on the prompt for three weeks, demo'd it to stakeholders, and launched it feeling confident. Two months later they discovered that the assistant had been silently hallucinating citations across roughly 30% of outputs — a failure mode no one had tested for because the eval suite was built after the prompt had already "felt right" in demos.

This pattern is the rule, not the exception. The LLM development industry has largely adopted test-driven development vocabulary — evals, regression suites, golden datasets, LLM-as-judge — while ignoring the most important rule TDD establishes: write the test before the implementation, not after.

Here is how to do that correctly, and the three places where the TDD analogy breaks down so badly that following it literally will make your system worse.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

February 7, 2026 · 10 min read

Tian Pan

Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

Your AI Product Needs Evals

October 8, 2025 · 8 min read

Tian Pan

Software Engineer

Every AI product demo looks great. The model generates something plausible, the stakeholders nod along, and everyone leaves the meeting feeling optimistic. Then the product ships, real users appear, and things start going sideways in ways nobody anticipated. The team scrambles to fix one failure mode, inadvertently creates another, and after weeks of whack-a-mole, the prompt has grown into a 2,000-token monster that nobody fully understands anymore.

The root cause is almost always the same: no evaluation system. Teams that ship reliable AI products build evals early and treat them as infrastructure, not an afterthought. Teams that stall treat evaluation as something to worry about "once the product is more mature." By then, they're already stuck.

About Tian Pan