14 posts tagged with "ci-cd"

How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away

April 10, 2026 · 11 min read

Software Engineer

Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.

The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.

The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.

Non-Deterministic CI for Agentic Systems: Why Binary Pass/Fail Breaks and What Replaces It

April 10, 2026 · 9 min read

Tian Pan

Software Engineer

Your CI pipeline assumes something that hasn't been true since you added an LLM call: that running the same code twice produces the same result. Traditional CI was built for deterministic software — compile, run tests, get a green or red light. Traditional ML evaluation was built for fixed input-output mappings — run inference on a test set, compute accuracy. Agentic AI breaks both assumptions simultaneously, and the result is a CI system that either lies to you or blocks every merge with false negatives.

The core problem isn't that agents are hard to test. It's that the testing infrastructure you already have was designed for a world where non-determinism is a bug, not a feature. When your agent takes a different tool-call path to the same correct answer on consecutive runs, a deterministic assertion fails. When it produces a semantically equivalent but lexically different response, string comparison flags a regression. The testing framework itself becomes the source of noise.

About Tian Pan