How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away
Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.
The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.
The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.
The Two-Sided Testing Trap
Before going into solutions, it's worth being precise about what each naive approach misses.
Live API testing runs actual calls to OpenAI or Anthropic on every PR. The costs stack up fast: a 20-scenario eval suite with LLM-as-judge scoring runs $0.50–$3.00 per test run. Multiply that by the number of developers, PRs per day, and the fact that PRs often push multiple times, and you're looking at $200–$1,000/month for a team of modest size — before you've even implemented nightly comprehensive runs. Latency is the other killer: a 15-second LLM call makes your CI feedback loop unusable. And crucially, LLM outputs vary even at temperature=0 due to hardware-level floating-point differences across provider regions, so tests that pass today can fail tomorrow without any code change.
Mocking the LLM away solves latency and cost but hollows out what you're actually testing. The danger is subtle: your test suite reaches 90% coverage metrics while leaving the entire orchestration layer untouched. Consider what a stub that returns a hardcoded string can't test:
- What happens when a tool returns an empty result?
- Does the agent correctly propagate context from tool A's output to tool B's input?
- Does the retry logic kick in on a 429, or does it silently swallow the error?
- What happens at step 7 of a 10-step loop when accumulated state exceeds the context window?
These are the failure modes that bite production systems. None of them involve model output quality — they're pure orchestration bugs. Mocking the model away means your test suite never even sees them.
Tier 1: Structural Tests With Fake LLMs (Every Commit, Zero Cost)
The first tier uses LLM test doubles — stub implementations of the LLM provider interface that respond deterministically based on the incoming prompt content.
A StubLLM implements the same interface your real LLM client uses, but instead of making network calls, it parses the prompt for test triggers and returns hardcoded tool-call responses:
class StubLLM:
def generate(self, prompt: str) -> Response:
if "trigger_timeout" in prompt:
raise TimeoutError("Request timeout")
if "trigger_rate_limit" in prompt:
raise RateLimitError("429: Too many requests")
# Default: request the weather tool
return Response(tool_call="get_weather", args={"city": "NYC"})
The critical insight is that the stub doesn't fake a final text response. It triggers a specific tool call, which forces your real middleware to execute. Your harness dispatches the tool, handles the result, updates state, and calls the LLM again with the next turn of context. The stub's second call triggers the next step. You've now tested the entire orchestration loop without a single real API call.
This pattern is especially valuable for testing infrastructure concerns: does your harness correctly block a delete_user_account call via RBAC? Does it propagate the authenticated user identity to tool execution? Does the error recovery path execute when a tool times out at step 4? These are thousands of deterministic test cases that cost nothing and run in milliseconds.
The limitation is clear: stub LLMs don't test prompt injection resilience, model output quality, or whether your system prompt actually elicits the behavior you want. They test the rails around the model, not the model itself.
Tier 2: Deterministic Replay via Cassette Recording (Every PR, Near-Zero Cost)
The second tier fills the gap between structural tests and live evaluation. VCR-style cassette recording intercepts HTTP calls at the transport layer, serializes the full request/response pair to a file, commits that file to version control, and replays it on subsequent runs.
pytest-recording wraps this behind a single decorator:
@pytest.mark.vcr()
def test_multi_step_research_agent():
result = run_research_agent("What caused the 2024 semiconductor shortage?")
assert result.steps_taken <= 8
assert "supply chain" in result.summary.lower()
First run with --record-mode=once: the test makes real API calls and writes a cassette file. Every subsequent run — including every PR in CI — replays from disk. No API calls, deterministic results, and CI runs in the same time as regular unit tests.
The deeper value beyond cost savings: if the HTTP request payload changes in any way — prompt wording shifts, model parameters update, input serialization changes — the cassette match fails and the test breaks. This catches unintended prompt modifications that wouldn't surface in traditional unit tests at all.
The multi-turn problem. Plain VCR.py works cleanly for single-call tests. Multi-turn agent conversations are harder: turn N's request body includes the model's response from turn N-1. If the agent branches differently at step 2, every subsequent cassette match fails. The workarounds:
- BAML VCR operates at the BAML runtime layer rather than the HTTP layer, preserving type information across turns and handling streaming responses chunk by chunk.
- vcr-langchain patches VCR.py to capture non-network LangChain tooling, though tools initialized outside the decorator scope don't get recording applied.
- For frameworks without specialized VCR support, the pragmatic approach is to record at the turn level rather than the session level — each individual LLM call gets its own cassette, and the test stitches them together.
One gotcha: cassettes record everything including Authorization headers. Configure filter_headers: ["authorization"] before committing cassettes to any public or shared repository.
What cassettes don't fix. When you upgrade a model version, all cassettes are stale by definition. The cassettes also don't protect against streaming schema changes or new model behaviors. This is expected — cassettes are regression tests for what your code does, not tests of what the model does.
Tier 3: Tool Contract Tests (Every PR, Schema Drift Detection)
Tool drift is the most insidious failure mode in production agent systems. An external API silently adds a required field. An enum gains a new value. A browser tool starts truncating results at a different character count. The individual components look fine. The agent completes tasks. But internally the orchestration is silently compensating in ways that mask the mismatch — until it can't.
The prevention pattern uses Pydantic schemas on both tool inputs and outputs, then tests the full round-trip. Rather than just testing that a tool function executes, you test that the model selected the right tool, extracted correctly typed parameters, received a valid response, and that response propagated correctly to the next tool that depends on it.
The nine contract tests worth running on every PR are:
- Schema-lock test: Replay known tool calls against live schemas; fail if required fields shift or enums change.
- Similar-tool routing test: Adversarial prompts designed to confuse semantically similar tools (testing that "message the team" doesn't route to "message the person").
- Error-semantic test: Each error class (invalid argument, permission denied, timeout, partial result) should trigger a specific intended next step.
- Cross-tool handoff test: Output from Tool A should propagate correctly typed to Tool B. Most multi-step regressions hide at this boundary.
- Tool-inventory drift test: Snapshot the full tool list (names, descriptions, scopes) and diff on every release.
- Pagination test: Agents that perform well on small results break on real datasets; test page boundaries explicitly.
These tests are cheap because they use recorded responses and schema validation — no live LLM calls. They catch the specific failure class (schema drift between agent and tool interface) that live API tests often miss because the model adapts to surprises in ways that hide the underlying mismatch.
The Three-Tier CI Pipeline
Putting the tiers together:
| Tier | Trigger | What runs | Cost |
|---|---|---|---|
| Structural | Every commit | StubLLM tests, unit tests, schema contract checks | ~$0 |
| Replay | Every PR | VCR cassette tests, tool-drift contract tests | ~$0 |
| Live | Merge to main / nightly | Full LLM evals, LLM-as-judge scoring, multi-turn agent runs | $0.50–$50 |
The GitHub Actions structure that implements this:
name: agent-ci
on:
pull_request:
paths: ["agent/**", "prompts/**", "tools/**"]
push:
branches: [main]
jobs:
fast-checks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pytest tests/unit/ tests/replay/ -x --vcr-record=none
live-evals:
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
steps:
- run: pytest tests/integration/ --eval-threshold=0.85
The --vcr-record=none flag on the replay job is important: it causes the test to error on any cassette miss rather than hitting the live API. This prevents accidentally incurring API costs on PRs and ensures that cassette-less tests fail loudly rather than silently passing with live data.
What This Architecture Prevents
Orchestration bugs at step N. A bug in how the agent handles an empty tool response at step 6 of an 8-step loop requires all 7 preceding steps to succeed before it surfaces. Replay cassettes with a full historical trace are the only practical way to reproduce this in CI without full environment setup. The regression trace test pattern — replaying high-value historical traces after any code change — is designed specifically for this.
Silent model version regressions. Model updates cause an estimated 40% of production agent failures. The pattern: pin to specific model versions, run schema-lock tests against the new model before promoting it, treat your cassette files as a regression baseline. Any cassette that breaks on model upgrade is a confirmed regression that needs triage.
Tool schema drift. An external API change that shifts a required field is caught by schema-lock tests at the boundary, before it can manifest as a model behavior change that looks like an LLM quality regression.
Authorization policy bypasses. The StubLLM pattern can request a delete_user_account tool call deterministically — something you'd need the real model to spontaneously generate to test with live calls. This makes security testing for agentic harnesses tractable in a way that live testing isn't.
The Cost Profile Over Time
A team with three developers running Tier 1 and 2 on every commit and Tier 3 only on main-branch merges typically runs 200–300 fast CI jobs and 15–20 live eval runs per month. At current API prices, the live eval runs cost $30–$80/month. The fast jobs cost nothing.
The more important metric is what the test suite catches. Teams that implement this three-tier structure consistently report that the first wave of live-eval regressions discovered on the main branch were already covered by cassette tests — the cassette broke on the same PR that introduced the regression. Tier 2 is doing real regression prevention work, not just saving money.
The organizational shift required is treating cassette files as first-class test artifacts: reviewed in PRs, updated deliberately when prompts change, committed with the code that generates them. That's a small process change with a disproportionate impact on how reliably you catch agent regressions before users do.
Where the Approach Breaks Down
This architecture does not solve evaluation of model output quality at scale. VCR cassettes test that your orchestration code does what it did before — they don't tell you whether what it did before was good. LLM-as-judge scoring in Tier 3 remains necessary for measuring actual output quality, and it remains expensive and noisy at scale.
The other limit is coverage of novel failure modes. Cassette-based replay only covers paths that were exercised during recording. An agent that's never been tested on a particular input combination has no cassette for it. This is why Tier 3 live evals against diverse input distributions remain essential — they discover the new failure modes, and Tier 2 regression tests lock them in once found.
The right mental model: Tier 1 and Tier 2 protect against regressions you already know about. Tier 3 discovers the regressions you don't know about yet. A team without Tier 3 will ship regressions they haven't discovered. A team without Tier 1 and Tier 2 will keep re-discovering the same regressions.
- https://anaynayak.medium.com/eliminating-flaky-tests-using-vcr-tests-for-llms-a3feabf90bc5
- https://pypi.org/project/pytest-recording/
- https://github.com/gr-b/baml_vcr
- https://github.com/amosjyng/vcr-langchain
- https://langwatch.ai/scenario/testing-guides/mocks/
- https://www.tigera.io/blog/how-to-stub-llms-for-ai-agent-security-testing-and-governance/
- https://medium.com/@duckweave/tool-drift-hides-in-the-gaps-75a68d8198d3
- https://blog.langchain.com/evaluating-deep-agents-our-learnings/
- https://www.braintrust.dev/articles/ai-agent-evaluation-framework
- https://www.braintrust.dev/articles/best-ai-evals-tools-cicd-2025
- https://labs.sogeti.com/building-a-smart-cost-efficient-llm-test-pipeline
- https://hamel.dev/blog/posts/evals-faq/how-do-i-evaluate-agentic-workflows.html
- https://docs.langchain.com/langsmith/cicd-pipeline-example
