Skip to main content

46 posts tagged with "testing"

View all tags

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

· 11 min read
Tian Pan
Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

Synthetic Eval Bootstrapping: How to Build Ground-Truth Datasets When You Have No Labeled Data

· 10 min read
Tian Pan
Software Engineer

The common failure mode isn't building AI features that don't work. It's shipping AI features without any way to know whether they work. And the reason teams skip evaluation infrastructure isn't laziness — it's that building evals requires labeled data, and on day one you have none.

This is the cold start problem for evals. To get useful signal, you need your system running in production. To deploy with confidence, you need evaluation infrastructure first. The circular dependency is real, and it causes teams to do one of three things: ship without evals and discover failures in production, delay shipping while hand-labeling data for months, or use synthetic evals — with all the risks that entails.

This post is about the third path done correctly. Synthetic eval bootstrapping works, but only if you understand what it cannot detect and build around those blind spots from the start.

Annotation-Free Evaluation: Measuring LLM Quality Before You Have Ground Truth

· 12 min read
Tian Pan
Software Engineer

Most teams ship an LLM feature, then spend weeks arguing about whether it's actually good. The evaluation question gets deferred because building a labeled dataset feels like a separate project. By the time you have ground truth, you've also accumulated two months of silent regressions you can never diagnose. This is backwards. You can get a meaningful quality signal in week one — before a single annotation is complete — if you know which techniques to reach for and where each one breaks.

This post is a field guide to annotation-free evaluation: the reference-free methods that work, the conditions they require, and the specific failure modes that will fool you if you're not careful.

Dev/Prod Parity for AI Apps: The Seven Ways Your Staging Environment Is Lying to You

· 11 min read
Tian Pan
Software Engineer

The 12-Factor App doctrine made dev/prod parity famous: keep development, staging, and production as similar as possible. For traditional web services, this is mostly achievable. For LLM applications, it is structurally impossible — and the gap is far larger than most teams realize.

The problem is not that developers are careless. It is that LLM applications depend on a class of infrastructure (cached computation, living model weights, evolving vector indexes, and stochastic generation) where the differences between staging and production are not merely inconvenient but categorically different in kind. A staging environment that looks correct will lie to you in at least seven specific ways.

The Long-Tail Coverage Problem: Why Your AI System Fails Where It Matters Most

· 10 min read
Tian Pan
Software Engineer

A medical AI deployed to a hospital achieves 97% accuracy in testing. It passes every internal review, gets shipped, and then quietly fails to detect parasitic infections when parasite density drops below 1% of cells — the exact scenario where early intervention matters most. Nobody notices until a physician flags an unusual miss rate on a specific patient population.

This is the long-tail coverage problem. Your aggregate metrics look fine. Your system is broken for the inputs that matter.

Shadow Traffic for AI Systems: The Safest Way to Validate Model Changes Before They Ship

· 10 min read
Tian Pan
Software Engineer

Most teams ship LLM changes the way they shipped web changes in 2005 — they run some offline evals, convince themselves the numbers look fine, and push. The surprise comes on Monday morning when a system prompt tweak that passed every benchmark silently breaks the 40% of user queries that weren't in the eval set.

Shadow traffic is the fix. The idea is simple: run your candidate model or prompt in parallel with production, feed it every real request, compare the outputs, and only expose users to the current version. Zero user exposure, real production data, and statistical confidence before anyone sees the change. But applying this to LLMs requires rethinking almost every piece of the implementation — because language models are non-deterministic, expensive to evaluate, and produce outputs that can't be compared with a simple diff.

The LLM Local Development Loop: Fast Iteration Without Burning Your API Budget

· 10 min read
Tian Pan
Software Engineer

Most teams building LLM applications discover the same problem around week three: every time someone runs the test suite, it fires live API calls, costs real money, takes 30+ seconds, and returns different results on each run. The "just hit the API" approach that felt fine during the prototype phase becomes a serious tax on iteration speed — and a meaningful line item on the bill. One engineering team audited their monthly API spend and found $1,240 out of $2,847 (43%) was pure waste from development and test traffic hitting live endpoints unnecessarily.

The solution is not to stop testing. It is to build the right kind of development loop from the start — one where the fast path is cheap and deterministic, and the slow path (real API calls) is reserved for when it actually matters.

Model Deprecation Readiness: Auditing Your Behavioral Dependency Before the 90-Day Countdown

· 8 min read
Tian Pan
Software Engineer

When Anthropic deprecated a Claude model last year, a company noticed — but only because a downstream parser started throwing errors in production. The culprit? The new model occasionally wrapped its JSON responses in markdown code blocks. The old model never did. Nobody had documented that assumption. Nobody had tested for it. The fix took an afternoon; the diagnosis took three days.

That pattern — silent behavioral dependency breaking loudly in production — is the defining failure mode of model migrations. You update a model ID, run a quick sanity check, and ship. Six weeks later, something subtle is wrong. Your JSON parsing is 0.6% more likely to fail. Your refusal rate on edge cases doubled. Your structured extraction misses a field it used to reliably populate. The diff isn't in the code — it's in the model's behavior, and you never wrote a contract for it.

With major providers now running on 60–180 day deprecation windows, and the pace of model releases accelerating, this is no longer a theoretical concern. It's a recurring operational challenge. Here's how to get ahead of it.

Prompt Regression Tests That Actually Block PRs

· 10 min read
Tian Pan
Software Engineer

Ask any AI engineering team if they test their prompts and they'll say yes. Ask if a bad prompt can fail a pull request and block a merge, and you'll get a much quieter room. The honest answer for most teams is no — they have eval notebooks they run occasionally, maybe a shared Notion doc of known prompt quirks, and a vague sense that things are worse than they used to be. That is not testing. That is hoping.

The gap exists because prompt testing feels qualitatively different from unit testing. Code either behaves correctly or it doesn't. Prompts produce outputs on a spectrum, outputs are non-deterministic, and running enough examples to feel confident costs real money. Those are real constraints. None of them are insurmountable. Teams that have built prompt CI that actually blocks merges are not spending fifty dollars a build — they're running in under three minutes at under a dollar using a few design decisions that make the problem tractable.

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

· 10 min read
Tian Pan
Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

API Contracts for Non-Deterministic Services: Versioning When Output Shape Is Stochastic

· 9 min read
Tian Pan
Software Engineer

Your content moderation service returns {"severity": "MEDIUM", "confidence": 0.85}. The downstream billing system parses severity as an enum with values ["low", "medium", "high"]. A model update causes the service to occasionally return "Medium" with a capital M. No deployment happened. No schema changed. The integration breaks in production, and nobody catches it for six days because the HTTP status codes are all 200.

This is the foundational problem with API contracts for LLM-backed services: the surface looks like a REST API, but the behavior underneath is probabilistic. Standard contract tooling assumes determinism. When that assumption breaks, it breaks silently.

The Demo-to-Production Failure Pattern: Why AI Prototypes Collapse When Real Users Arrive

· 10 min read
Tian Pan
Software Engineer

Thirty percent of generative AI projects are abandoned after proof of concept. Ninety-five percent of enterprise pilots deliver zero measurable business impact. Gartner projects 40% of agentic AI projects will be canceled before the end of 2027. These aren't failures of the underlying technology — they're failures of the gap between demo and production.

The demo-to-production failure pattern is predictable, repeatable, and almost entirely preventable. It happens because the conditions that make a demo look great are systematically different from the conditions that make production work. Teams optimize for the former and get ambushed by the latter.