How to Integration-Test AI Agent Workflows in CI Without Mocking the Model Away
Most teams building AI agents discover the same testing trap after their first production incident. You have two obvious options: make live API calls in CI (slow, expensive, non-deterministic), or mock the LLM away entirely (fast, cheap, hollow). Both approaches fail in different but predictable ways, and the failure mode of the second is worse because it's invisible.
The team that mocks the LLM away runs green CI for six months, ships to production, and then discovers that a bug in how their agent handles a malformed tool response at step 6 of an 8-step loop has been lurking in the codebase the entire time. The mock that always returns "Agent response here" never exercised the orchestration layer at all. The actual tool dispatch, retry logic, state accumulation, and fallback routing code was never tested.
The good news is there's a third path. It's less a single technique and more a layered architecture of three test tiers, each designed to catch a different class of failure without the costs of the other approaches.
The Two-Sided Testing Trap
Before going into solutions, it's worth being precise about what each naive approach misses.
Live API testing runs actual calls to OpenAI or Anthropic on every PR. The costs stack up fast: a 20-scenario eval suite with LLM-as-judge scoring runs $0.50–$3.00 per test run. Multiply that by the number of developers, PRs per day, and the fact that PRs often push multiple times, and you're looking at $200–$1,000/month for a team of modest size — before you've even implemented nightly comprehensive runs. Latency is the other killer: a 15-second LLM call makes your CI feedback loop unusable. And crucially, LLM outputs vary even at temperature=0 due to hardware-level floating-point differences across provider regions, so tests that pass today can fail tomorrow without any code change.
Mocking the LLM away solves latency and cost but hollows out what you're actually testing. The danger is subtle: your test suite reaches 90% coverage metrics while leaving the entire orchestration layer untouched. Consider what a stub that returns a hardcoded string can't test:
- What happens when a tool returns an empty result?
- Does the agent correctly propagate context from tool A's output to tool B's input?
- Does the retry logic kick in on a 429, or does it silently swallow the error?
- What happens at step 7 of a 10-step loop when accumulated state exceeds the context window?
These are the failure modes that bite production systems. None of them involve model output quality — they're pure orchestration bugs. Mocking the model away means your test suite never even sees them.
Tier 1: Structural Tests With Fake LLMs (Every Commit, Zero Cost)
The first tier uses LLM test doubles — stub implementations of the LLM provider interface that respond deterministically based on the incoming prompt content.
A StubLLM implements the same interface your real LLM client uses, but instead of making network calls, it parses the prompt for test triggers and returns hardcoded tool-call responses:
class StubLLM:
def generate(self, prompt: str) -> Response:
if "trigger_timeout" in prompt:
raise TimeoutError("Request timeout")
if "trigger_rate_limit" in prompt:
raise RateLimitError("429: Too many requests")
# Default: request the weather tool
return Response(tool_call="get_weather", args={"city": "NYC"})
The critical insight is that the stub doesn't fake a final text response. It triggers a specific tool call, which forces your real middleware to execute. Your harness dispatches the tool, handles the result, updates state, and calls the LLM again with the next turn of context. The stub's second call triggers the next step. You've now tested the entire orchestration loop without a single real API call.
This pattern is especially valuable for testing infrastructure concerns: does your harness correctly block a delete_user_account call via RBAC? Does it propagate the authenticated user identity to tool execution? Does the error recovery path execute when a tool times out at step 4? These are thousands of deterministic test cases that cost nothing and run in milliseconds.
The limitation is clear: stub LLMs don't test prompt injection resilience, model output quality, or whether your system prompt actually elicits the behavior you want. They test the rails around the model, not the model itself.
Tier 2: Deterministic Replay via Cassette Recording (Every PR, Near-Zero Cost)
The second tier fills the gap between structural tests and live evaluation. VCR-style cassette recording intercepts HTTP calls at the transport layer, serializes the full request/response pair to a file, commits that file to version control, and replays it on subsequent runs.
pytest-recording wraps this behind a single decorator:
@pytest.mark.vcr()
def test_multi_step_research_agent():
result = run_research_agent("What caused the 2024 semiconductor shortage?")
assert result.steps_taken <= 8
assert "supply chain" in result.summary.lower()
First run with --record-mode=once: the test makes real API calls and writes a cassette file. Every subsequent run — including every PR in CI — replays from disk. No API calls, deterministic results, and CI runs in the same time as regular unit tests.
The deeper value beyond cost savings: if the HTTP request payload changes in any way — prompt wording shifts, model parameters update, input serialization changes — the cassette match fails and the test breaks. This catches unintended prompt modifications that wouldn't surface in traditional unit tests at all.
The multi-turn problem. Plain VCR.py works cleanly for single-call tests. Multi-turn agent conversations are harder: turn N's request body includes the model's response from turn N-1. If the agent branches differently at step 2, every subsequent cassette match fails. The workarounds:
- BAML VCR operates at the BAML runtime layer rather than the HTTP layer, preserving type information across turns and handling streaming responses chunk by chunk.
- vcr-langchain patches VCR.py to capture non-network LangChain tooling, though tools initialized outside the decorator scope don't get recording applied.
- For frameworks without specialized VCR support, the pragmatic approach is to record at the turn level rather than the session level — each individual LLM call gets its own cassette, and the test stitches them together.
One gotcha: cassettes record everything including Authorization headers. Configure filter_headers: ["authorization"] before committing cassettes to any public or shared repository.
What cassettes don't fix. When you upgrade a model version, all cassettes are stale by definition. The cassettes also don't protect against streaming schema changes or new model behaviors. This is expected — cassettes are regression tests for what your code does, not tests of what the model does.
Tier 3: Tool Contract Tests (Every PR, Schema Drift Detection)
- https://anaynayak.medium.com/eliminating-flaky-tests-using-vcr-tests-for-llms-a3feabf90bc5
- https://pypi.org/project/pytest-recording/
- https://github.com/gr-b/baml_vcr
- https://github.com/amosjyng/vcr-langchain
- https://langwatch.ai/scenario/testing-guides/mocks/
- https://www.tigera.io/blog/how-to-stub-llms-for-ai-agent-security-testing-and-governance/
- https://medium.com/@duckweave/tool-drift-hides-in-the-gaps-75a68d8198d3
- https://blog.langchain.com/evaluating-deep-agents-our-learnings/
- https://www.braintrust.dev/articles/ai-agent-evaluation-framework
- https://www.braintrust.dev/articles/best-ai-evals-tools-cicd-2025
- https://labs.sogeti.com/building-a-smart-cost-efficient-llm-test-pipeline
- https://hamel.dev/blog/posts/evals-faq/how-do-i-evaluate-agentic-workflows.html
- https://docs.langchain.com/langsmith/cicd-pipeline-example
