64 posts tagged with "testing"

The LLM Local Development Loop: Fast Iteration Without Burning Your API Budget

April 18, 2026 · 10 min read

Software Engineer

Most teams building LLM applications discover the same problem around week three: every time someone runs the test suite, it fires live API calls, costs real money, takes 30+ seconds, and returns different results on each run. The "just hit the API" approach that felt fine during the prototype phase becomes a serious tax on iteration speed — and a meaningful line item on the bill. One engineering team audited their monthly API spend and found $1,240 out of $2,847 (43%) was pure waste from development and test traffic hitting live endpoints unnecessarily.

The solution is not to stop testing. It is to build the right kind of development loop from the start — one where the fast path is cheap and deterministic, and the slow path (real API calls) is reserved for when it actually matters.

Model Deprecation Readiness: Auditing Your Behavioral Dependency Before the 90-Day Countdown

April 18, 2026 · 8 min read

Tian Pan

Software Engineer

When Anthropic deprecated a Claude model last year, a company noticed — but only because a downstream parser started throwing errors in production. The culprit? The new model occasionally wrapped its JSON responses in markdown code blocks. The old model never did. Nobody had documented that assumption. Nobody had tested for it. The fix took an afternoon; the diagnosis took three days.

That pattern — silent behavioral dependency breaking loudly in production — is the defining failure mode of model migrations. You update a model ID, run a quick sanity check, and ship. Six weeks later, something subtle is wrong. Your JSON parsing is 0.6% more likely to fail. Your refusal rate on edge cases doubled. Your structured extraction misses a field it used to reliably populate. The diff isn't in the code — it's in the model's behavior, and you never wrote a contract for it.

With major providers now running on 60–180 day deprecation windows, and the pace of model releases accelerating, this is no longer a theoretical concern. It's a recurring operational challenge. Here's how to get ahead of it.

Prompt Regression Tests That Actually Block PRs

April 18, 2026 · 10 min read

Tian Pan

Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

API Contracts for Non-Deterministic Services: Versioning When Output Shape Is Stochastic

April 17, 2026 · 9 min read

Tian Pan

Software Engineer

Your content moderation service returns {"severity": "MEDIUM", "confidence": 0.85}. The downstream billing system parses severity as an enum with values ["low", "medium", "high"]. A model update causes the service to occasionally return "Medium" with a capital M. No deployment happened. No schema changed. The integration breaks in production, and nobody catches it for six days because the HTTP status codes are all 200.

This is the foundational problem with API contracts for LLM-backed services: the surface looks like a REST API, but the behavior underneath is probabilistic. Standard contract tooling assumes determinism. When that assumption breaks, it breaks silently.

The Demo-to-Production Failure Pattern: Why AI Prototypes Collapse When Real Users Arrive

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Thirty percent of generative AI projects are abandoned after proof of concept. Ninety-five percent of enterprise pilots deliver zero measurable business impact. Gartner projects 40% of agentic AI projects will be canceled before the end of 2027. These aren't failures of the underlying technology — they're failures of the gap between demo and production.

The demo-to-production failure pattern is predictable, repeatable, and almost entirely preventable. It happens because the conditions that make a demo look great are systematically different from the conditions that make production work. Teams optimize for the former and get ambushed by the latter.

Property-Based Testing for LLM Outputs: Finding the Bugs Your Eval Set Never Imagined

April 17, 2026 · 11 min read

Tian Pan

Software Engineer

Your eval suite says 94% accuracy. Users report the feature is broken for names that aren't "John" or "Alice." Both things are true, and the gap between them has a name: your curated test set encodes only the failure modes you already imagined.

Property-based testing (PBT) was invented in 1999 to expose exactly this class of blind spot in deterministic software. Applied to LLM outputs, it generates tens of thousands of adversarial input variants automatically, probing domain boundaries that hand-written test cases structurally cannot reach. A 2025 OOPSLA study found that on average each property-based test discovers approximately 50 times as many mutant bugs as the average unit test. A separate study measured that PBT and example-based testing (EBT) fail on different bugs — combining both raised detection rates from 68.75% to 81.25%. That 12.5-point gap is not rounding error; it represents an entire class of failure invisible to one approach.

This article is for engineers who already have eval suites and want to find the bugs those suites structurally cannot find.

What Semantic Versioning Actually Means for AI Agents

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your customer service agent has been running reliably for three months. A routine model update rolls in on a Tuesday. By Wednesday afternoon, three downstream services are silently parsing the wrong fields from the agent's responses—the JSON keys shifted subtly but nothing returned an error. By Thursday you've traced a drop in order completions to a JSON field renamed from "status" to "current_state". The model updated, the agent stayed at v2.1.0, and nobody got paged.

This is the versioning gap that nobody in traditional API design had to solve. Semver works when you can deterministically reproduce outputs from a specification. AI agents can't make that promise. Yet downstream services depend on their behavior just as critically as they depend on any microservice API. The gap between "we tagged a release" and "downstream consumers are protected" has never been wider.

Testing the Untestable: Integration Contracts for LLM-Powered APIs

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your test suite passes. The CI is green. You ship the new prompt. Three days later, a user reports that your API is returning JSON with a trailing comma — and your downstream parser has been silently dropping records for 72 hours. You never wrote a test for that because the LLM "always" returned valid JSON in development.

This is the failure mode that ruins LLM-powered products: not catastrophic model collapse, but quiet, intermittent degradation that deterministic test suites are structurally incapable of catching. The root cause isn't laziness — it's that the whole paradigm of "expected == actual" breaks when your system produces non-deterministic natural language.

Fixing this requires rethinking what you're testing and what "passing" even means for an LLM-powered API. The engineers who've figured this out aren't writing smarter equality assertions — they're writing fundamentally different kinds of tests.

The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features

April 17, 2026 · 10 min read

Tian Pan

Software Engineer

Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.

This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

April 16, 2026 · 12 min read

Tian Pan

Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

The Implicit API Contract: What Your LLM Provider Doesn't Document

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Your LLM provider's SLA covers HTTP uptime and Time to First Token. It says nothing about whether the model will still follow your formatting instructions next month, refuse requests it accepted last week, or return valid JSON under edge-case conditions you haven't tested. Most engineering teams discover this the hard way — via a production incident, not a changelog.

This is the implicit API contract problem. Traditional APIs promise stable, documented behavior. LLM providers promise a connection. Everything between the request and what your application does with the response is on you.

About Tian Pan