Skip to main content

55 posts tagged with "testing"

View all tags

Bug Bashes for AI Features: Sampling a Distribution, Not Hunting Defects

· 11 min read
Tian Pan
Software Engineer

The classic bug bash is a deterministic ritual built for deterministic software. Ten engineers crowd a Slack channel for two hours, hammer a checklist of golden-path flows, and file tickets with crisp repro steps: "Click X, see Y, expected Z." It works because the system under test is reproducible — same input, same output, same bug, every time.

Run that exact ritual against an AI feature and you will produce two hundred tickets, close one hundred and eighty as "expected stochastic variation," and miss the twenty that signal a real cohort regression. The format isn't just stale; it's actively miscalibrated. A bug bash against an LLM-backed feature is not a defect-hunting session. It is a sampling exercise against a probability distribution, and the team that runs it like a deterministic test session is collecting noise and calling it signal.

This post is about how to redesign the bug bash for stochastic systems — what to change about the format, the participants, the triage rubric, and what counts as "done."

The Eval Migration Tax: Why a Prompt Schema Change Wrecks 800 Test Cases

· 11 min read
Tian Pan
Software Engineer

Every AI team I've watched ship a "small" output schema change has lived through the same week. Someone renames a field in the system prompt — say, summary becomes tldr, or the tool catalog gains a required confidence parameter — and the next CI run lights up red across 800 eval cases that have nothing to do with the change. The prompt diff is fifteen lines. The eval diff is a four-day migration project nobody scoped, owned, or budgeted.

This is the eval migration tax. It is the maintenance cost no roadmap accounts for, paid in delayed releases that get blamed on "flaky tests" rather than the architectural choice that actually caused them. Most teams pay it for years before they recognize the pattern, because each individual incident looks like ordinary churn. The compounding only becomes visible when you tally the engineering hours spent migrating evals across a quarter and realize they exceed the hours spent improving the model behavior the evals were supposed to measure.

Snapshot Tests Lie When Your Model Is Stochastic

· 11 min read
Tian Pan
Software Engineer

The first time a junior engineer on your team types --update-snapshots and pushes to main, your test suite stops being a test suite. It becomes a transcript. The diffs still render in green and red, the CI badge still flips to passing, but the signal has quietly inverted: instead of telling you whether the code is correct, the suite now tells you whether anyone bothered to look at the output. With deterministic code that ratio is acceptably low, because most diffs really are intentional. With a stochastic model on the other end of a network call, the same workflow turns every PR into a coin flip, and every reviewer into a rubber stamp.

Snapshot testing was a beautiful idea for a deterministic world. You record what render(<Button />) produced last Tuesday, you assert that this Tuesday it produces the same string, and any diff is, by definition, a behavior change worth a human eyeball. The pattern survived Jest, Vitest, Pytest, the whole React ecosystem, and a generation of UI snapshot extensions, because the underlying contract held: same input plus same code equals same output. The contract does not hold for an LLM call. Same input plus same code plus same prompt produces a different string, and the difference is not a bug — it is the product working as designed.

Eval-Author Monoculture: Why Your Benchmark Becomes a Self-Portrait

· 11 min read
Tian Pan
Software Engineer

Green CI is not the statement "this prompt works." Green CI is the statement "the engineer who wrote the evals could not think of how this prompt should break." Those are very different claims, and the gap between them is where your production incidents live. An eval suite is not a measurement of your model — it is a frozen portrait of whoever wrote it. Their dialect, their domain knowledge, their seniority, their pet failure modes, the model they happened to be using when they wrote the test cases. Everything that engineer would not think to test is, by construction, untested. And worse: they will keep extending the suite from the same vantage point, so the blind spot does not shrink as the suite grows. It calcifies.

This is the eval-author monoculture problem, and it is the most under-discussed reliability risk in AI engineering today. Teams obsess over judge bias, position bias, verbosity bias, leakage, and contamination — but the upstream bias is the bias of the human who decided what the test cases should be in the first place. Every other source of eval error gets amplified by it. If your suite was written by one person, you have a benchmark with a personality, and that personality is the silent ceiling on what your CI can ever catch.

Your Eval Harness Runs Single-User. Your Agents Don't.

· 9 min read
Tian Pan
Software Engineer

Your agent passes 92% of your eval suite. You ship it. Within an hour of real traffic, something that never appeared in any trace is happening: agents are stalling on rate-limit retry storms, a customer sees another customer's draft email in a tool response, and your provider connection pool is sitting at 100% utilization while CPU is idle. None of these failures live in the model. They live in the gap between how you tested and how production runs.

The gap has a single shape. Your eval harness loops one agent at a time through a fixed dataset. Your production loops many agents at once through shared infrastructure. Sequential evaluation hides every bug whose precondition is "two things touching the same resource." Until you build adversarial concurrency into the harness itself, those bugs will only surface as on-call pages.

Eval Passed, With All Tools Mocked: Why Your Agent's Hardest Failures Never Reach the Harness

· 9 min read
Tian Pan
Software Engineer

Your agent hits 94% on the eval suite. Your on-call rotation is on fire. Nobody in the room is lying; both numbers are honest. What's happening is that the harness is testing a prompt, and production is testing an agent, and those are two different artifacts that happen to share weights.

Mocked-tool evals are almost always how this gap opens. You stub search_orders, charge_card, and send_email with canned JSON, feed the model a user turn, and assert on the final response. The run is cheap, deterministic, and reproducible — every property a CI system loves. It is also silent on tool selection, latency, rate limits, partial failures, and retry behavior, which is to say silent on the set of failures that dominate post-incident reviews.

Semantic Diff for Prompts: Why Git Diff Lies About What Your Prompt Change Will Do

· 10 min read
Tian Pan
Software Engineer

A teammate opens a pull request that rewrites your agent's system prompt from 420 lines to 380. The diff is green-and-red carnage: deleted paragraphs, moved sections, tightened language. You approve it because the cleanup looks sensible. A week later, refund-request accuracy has dropped eight points and nobody can say which line did it.

A different teammate adds the word "concise" to one instruction. Three characters of diff. Nobody reviews it closely because there is almost nothing to review. That edit flips tool-call behavior on 22% of queries.

Contract Testing for AI Pipelines: Schema-Validated Handoffs Between AI Components

· 10 min read
Tian Pan
Software Engineer

Most AI pipeline failures aren't model failures. The model fires fine. The output looks like JSON. The downstream stage breaks silently because a field was renamed, a type changed, or a nested object gained a new required property that the next stage doesn't know how to handle. The pipeline runs to completion and reports success. Somewhere in the data warehouse, numbers are wrong.

This is the contract testing problem for AI pipelines, and it's one of the most underaddressed reliability risks in production AI systems. According to recent infrastructure benchmarks, the average enterprise AI system experiences nearly five pipeline failures per month—each taking over twelve hours to resolve. The dominant cause isn't poor model quality. It's data quality and schema contract violations: 64% of AI risk lives at the schema layer.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

· 11 min read
Tian Pan
Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

Synthetic Eval Bootstrapping: How to Build Ground-Truth Datasets When You Have No Labeled Data

· 10 min read
Tian Pan
Software Engineer

The common failure mode isn't building AI features that don't work. It's shipping AI features without any way to know whether they work. And the reason teams skip evaluation infrastructure isn't laziness — it's that building evals requires labeled data, and on day one you have none.

This is the cold start problem for evals. To get useful signal, you need your system running in production. To deploy with confidence, you need evaluation infrastructure first. The circular dependency is real, and it causes teams to do one of three things: ship without evals and discover failures in production, delay shipping while hand-labeling data for months, or use synthetic evals — with all the risks that entails.

This post is about the third path done correctly. Synthetic eval bootstrapping works, but only if you understand what it cannot detect and build around those blind spots from the start.

Annotation-Free Evaluation: Measuring LLM Quality Before You Have Ground Truth

· 12 min read
Tian Pan
Software Engineer

Most teams ship an LLM feature, then spend weeks arguing about whether it's actually good. The evaluation question gets deferred because building a labeled dataset feels like a separate project. By the time you have ground truth, you've also accumulated two months of silent regressions you can never diagnose. This is backwards. You can get a meaningful quality signal in week one — before a single annotation is complete — if you know which techniques to reach for and where each one breaks.

This post is a field guide to annotation-free evaluation: the reference-free methods that work, the conditions they require, and the specific failure modes that will fool you if you're not careful.