Skip to main content

Bug Bashes for AI Features: Sampling a Distribution, Not Hunting Defects

· 11 min read
Tian Pan
Software Engineer

The classic bug bash is a deterministic ritual built for deterministic software. Ten engineers crowd a Slack channel for two hours, hammer a checklist of golden-path flows, and file tickets with crisp repro steps: "Click X, see Y, expected Z." It works because the system under test is reproducible — same input, same output, same bug, every time.

Run that exact ritual against an AI feature and you will produce two hundred tickets, close one hundred and eighty as "expected stochastic variation," and miss the twenty that signal a real cohort regression. The format isn't just stale; it's actively miscalibrated. A bug bash against an LLM-backed feature is not a defect-hunting session. It is a sampling exercise against a probability distribution, and the team that runs it like a deterministic test session is collecting noise and calling it signal.

This post is about how to redesign the bug bash for stochastic systems — what to change about the format, the participants, the triage rubric, and what counts as "done."

Why the Inherited Format Fails

Three assumptions are baked into the traditional bug bash, and an AI feature breaks all of them.

The first assumption is reproducibility. In classic QA, "I clicked the button and it crashed" is a complete report. In an AI feature, the same prompt can return five different answers in five consecutive runs and one of them may be dangerously wrong. A single observation is statistical noise. A bug report without a repro count is not a bug report — it's an anecdote.

The second assumption is the failure mode is structural. Buttons don't render, the form doesn't submit, the API returns 500. These are categorical failures: present or absent. AI features fail along a quality gradient. The button renders; the answer is just subtly wrong for users in a specific industry. Or the answer is correct, but its tone is off in a way that erodes trust. A binary "broken / not broken" rubric can't represent these failures, so they get filed as "minor" or not at all.

The third assumption is engineers can recognize the failure. In a deterministic system, a 500 error looks the same to everyone. In a domain-specialized AI feature — legal contract review, medical summarization, code refactoring for a specific stack — the failures that matter are domain-correctness failures. An engineer reading a clinical note summary cannot tell if "post-op day three" was misclassified as "post-op day two." A lawyer can. The failure is invisible to the people you invited.

Stack these three together and the inherited format becomes worse than useless. It produces an avalanche of unreproducible, miscategorized, surface-level reports that drown out the few that signal a real, exploitable pattern. Worse, it gives the team false confidence: "We bug-bashed it, we found stuff, we shipped it."

Reframe the Goal: From Defect Discovery to Distribution Mapping

The mental shift that has to land is this: a bug bash for an AI feature is not finding defects, it is mapping the failure distribution of a system that was always going to have failures. You are not asking "is it broken?" — you are asking "where does it fail, how often, and on which slice of users?"

That reframe changes everything downstream. You no longer want each participant pounding the same golden flow. You want them spreading out across the input space — different user cohorts, different task types, different inputs that probe different parts of the model's competence. Each session is a sampling pass; each ticket is a sample, not a verdict.

Concretely, before the bug bash starts, the organizer should pre-define the cohort grid: industry vertical, account tier, language, task type, novice vs. power user, short vs. long input. Each participant gets a cell. If you have eight people and four cohorts, two people sample each. If you have a dozen and six cohorts, two per cell with deliberate overlap so you can compare independent samples on the same slice.

This is not a small reorganization. It changes who you invite, how you brief them, what you ask them to file, and how you triage afterward.

Adversarial Inputs Beat Golden-Path Inputs

The other shift is adversarial. A traditional bug bash exercises the happy path — the flows the product manager wants to demo on Friday. For a deterministic system, that's reasonable; if the happy path works, exotic inputs probably also work, because the underlying logic is the same.

For AI, the happy path tells you almost nothing. Models are over-trained on the kind of inputs that look like product demos. The prompts that distinguish a robust feature from a fragile one are the ones the demo would never show: typos, mixed languages, half-finished sentences, deliberately misleading framings, prompt injection attempts, inputs that conflict with the system prompt, inputs at the token-limit boundary, inputs in the user's actual messy production format rather than the synthetic clean format the eval suite uses.

Red-teaming research has converged on this point. Manual adversarial testing excels at uncovering the nuanced edge cases that automated suites miss, especially multi-turn conversational attacks where the failure builds across turns. Recent work has shown attack-success rates climbing past 80% for techniques like role-play impersonation and even rhyming-stanza obfuscation that would never appear in a benchmark dataset. Your bug bash needs to channel this energy.

Practically, that means giving each participant an adversarial brief along with their cohort assignment. "You are a user in healthcare-vertical small-enterprise tier. Try to make the assistant give medical advice it shouldn't. Try to make it confidently cite a paper that doesn't exist. Try to make it leak the system prompt. Try to make it disagree with itself across two turns." The brief replaces the golden-path checklist.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates