Skip to main content

55 posts tagged with "testing"

View all tags

The Testing Pyramid Inverts for AI: Why Unit Tests Are the Wrong Investment for LLM Features

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM feature. The unit tests pass. CI is green. You deploy. Then users start reporting that the AI "just doesn't work right" — answers are weirdly formatted, the agent picks the wrong tool, context gets lost halfway through a multi-step task. You look at the test suite and it's still green. Every test passes. The feature is broken.

This is not bad luck. It is what happens when you apply a deterministic testing philosophy to a probabilistic system. The classic testing pyramid — wide base of unit tests, smaller middle layer of integration tests, narrow top of end-to-end tests — rests on one assumption so fundamental that nobody writes it down: the code does the same thing every time. LLMs violate this assumption at every level. The testing strategy built on top of it needs to be rebuilt from scratch.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

The Implicit API Contract: What Your LLM Provider Doesn't Document

· 10 min read
Tian Pan
Software Engineer

Your LLM provider's SLA covers HTTP uptime and Time to First Token. It says nothing about whether the model will still follow your formatting instructions next month, refuse requests it accepted last week, or return valid JSON under edge-case conditions you haven't tested. Most engineering teams discover this the hard way — via a production incident, not a changelog.

This is the implicit API contract problem. Traditional APIs promise stable, documented behavior. LLM providers promise a connection. Everything between the request and what your application does with the response is on you.

Keeping Synthetic Eval Data Honest

· 9 min read
Tian Pan
Software Engineer

A safety model scored 85.3% accuracy on its public benchmark test set. When researchers tested it on novel adversarial prompts not derived from public datasets, that number dropped to 33.8%. The model hadn't learned to reason about safety. It had learned to recognize the evaluation distribution.

This is the problem at the center of synthetic eval data: when the same model family generates both your training data and your test cases, passing the eval means conforming to a shared statistical prior—not demonstrating actual capability. It's a feedback loop that looks like quality assurance until production traffic arrives and the numbers don't hold.

The failure is structural, not incidental. And fixing it requires more than adding more synthetic examples.

Multi-Session Eval Design: Catching the AI Feature That Gets Worse Over Time

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed every eval at launch. Six weeks in, churn in the cohort that talks to it most has doubled, and your CSAT dashboard shows a flat line that no one can explain. The prompts haven't changed, the model hasn't been swapped, the retrieval index has grown but nobody thinks it's broken. What shipped was fine on turn one. What rots is what happens on turn four hundred, in session seventeen, three weeks after signup.

Most teams' eval suites can't see this failure. They test single-turn accuracy on a fixed dataset, maybe single-session multi-turn if they're ambitious, and then declare the feature shippable. The failure mode that matters — quality that degrades as the system accumulates state about a user — lives in a temporal dimension the eval harness was never built to cover. Researchers call it "self-degradation" in the memory literature: a clear, sustained performance decline after the initial phase, driven by memory inflation and the accumulation of flawed memories. Production engineers call it the reason their retention cohort silently bleeds.

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.

Why Your AI Demo Always Outperforms Your Launch

· 8 min read
Tian Pan
Software Engineer

The demo was spectacular. The model answered every question fluently, summarized documents without hallucination, and handled every edge case you threw at it. Stakeholders were impressed. The launch date was set.

Three weeks after shipping, accuracy was somewhere around 60%. Users were confused. Tickets were piling up. The model that aced your showcase was stumbling through production traffic.

This is not a story about a bad model. It is a story about a mismatch that almost every team building LLM features encounters: the inputs you tested on are not the inputs your users send.

Behavioral Contracts: Writing AI Requirements That Engineers Can Actually Test

· 11 min read
Tian Pan
Software Engineer

Most AI projects that die in the QA phase don't fail because the model is bad. They fail because nobody agreed on what "good" meant before the model was built. The acceptance criteria in the ticket said something like "the summarization feature should produce accurate, relevant summaries" — and when the engineer asked what "accurate" meant, the answer was "you know it when you see it." That is not a behavioral requirement. That is a hope.

The problem compounds because teams imported their existing requirements process from deterministic software and applied it unchanged to systems that are fundamentally stochastic. When you write assertTrue(output.equals("Paris")) for a database query, the test either passes or fails with complete certainty. When you write the same shape of assertion for an LLM, you get a test that fails on every valid paraphrase and passes on every confident hallucination. The unit test is lying to you, and the spec it was derived from was never designed for a system that generates distributions of outputs rather than single values.

The Integration Test Mirage: Why Mocked Tool Outputs Hide Your Agent's Real Failure Modes

· 11 min read
Tian Pan
Software Engineer

Your agent passes every test. The CI pipeline is green. You ship it.

A week later, a user reports that their bulk-export job silently returned 200 records instead of 14,000. The agent hit the first page of a paginated API, got a clean response, assumed there was nothing more, and moved on. Your mock returned all 200 items in one shot. The real API never told the agent there were 70 more pages.

This is not a model failure. The model reasoned correctly. This is a test infrastructure failure — and it's endemic to how teams build and test agentic systems.

Goodhart's Law in Your LLM Eval Suite: When Optimizing the Score Breaks the System

· 9 min read
Tian Pan
Software Engineer

Andrej Karpathy put it bluntly: AI labs were "overfitting" to Arena rankings. One major lab privately evaluated 27 model variants before their public release, publishing only the top performer. Researchers estimated that selective submission alone could artificially inflate leaderboard scores by up to 112%. The crowdsourced evaluation system that everyone pointed to as ground truth had become a target — and once it became a target, it stopped being a useful measure.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. It's been well-understood in economics and policy for decades. In LLM engineering, it's actively destroying eval suites right now, often without the teams building them realizing it.

Chaos Engineering for AI Agents: Injecting the Failures Your Agents Will Actually Face

· 9 min read
Tian Pan
Software Engineer

Your agent works perfectly in staging. It calls the right tools, reasons through multi-step plans, and returns polished results. Then production happens: the geocoding API times out at step 3 of a 7-step plan, the LLM returns a partial response mid-sentence, and your agent confidently fabricates data to fill the gap. Nobody notices until a customer does.

LLM API calls fail 1–5% of the time in production — rate limits, timeouts, server errors. For a multi-step agent making 10–20 tool calls per task, that means a meaningful percentage of tasks will hit at least one failure. The question isn't whether your agent will encounter faults. It's whether you've ever tested what happens when it does.

Property-Based Testing for LLM Systems: Invariants That Hold Even When Outputs Don't

· 12 min read
Tian Pan
Software Engineer

A product team at a fintech company shipped an LLM-powered document summarizer. Their eval dataset — 200 hand-curated examples with human ratings — scored 87% quality. In production, the system occasionally returned summaries longer than the original documents when users uploaded short memos. The eval set had no memos under 300 words. The property "output length ≤ input length for summarization tasks" was never tested. Nobody noticed until a customer screenshotted the absurdity and posted it online.

This is the fundamental gap that property-based testing (PBT) fills. Eval datasets measure accuracy on what you thought to test. Property-based tests measure whether your system obeys a contract across the entire space of what could happen.