990 posts tagged with "insider"

The AI Reliability Floor: Why 80% Accurate Is Worse Than No AI at All

April 16, 2026 · 9 min read

Software Engineer

Most teams measure AI feature quality by asking "how often is it right?" The more useful question is "how often does being wrong destroy trust faster than being right builds it?" These questions have different answers — and only the second one tells you whether to ship.

There is a reliability floor below which an AI feature does more damage than no feature at all. Below it, users learn to distrust the AI after enough errors, and that distrust generalizes: they stop trusting the feature when it is correct, they route around it, and eventually they stop using it entirely. At that point, you have not shipped a partially-useful product; you have shipped a conversion and retention hazard disguised as a feature.

The AI Procurement Gap: Why Your Vendor Evaluation Process Can't Handle Probabilistic Systems

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

A procurement team I worked with spent eleven weeks scoring four LLM vendors against a 312-row RFP spreadsheet. They negotiated 99.9% uptime, $0.0008 per 1K input tokens, SOC 2 Type II, and a glossy benchmark PDF that put their selected vendor 2.3 points ahead on MMLU. The contract was signed on a Friday. The following Tuesday, the vendor silently rolled a model update, and the customer-support agent the team had built started routing roughly 14% of refund requests to the wrong queue. The uptime SLA was honored. The benchmark scores were unchanged. The procurement process had functioned exactly as designed, and the system was still broken.

This is the AI procurement gap. The instruments enterprise procurement uses to manage software risk — feature checklists, uptime guarantees, security questionnaires, sample benchmarks — were built for systems whose outputs are reproducible. None of those instruments measure the thing that actually determines whether an AI vendor will keep working for you: the behavioral stability of a stochastic surface that the vendor controls and you do not.

Backpressure for LLM Pipelines: Queue Theory Applied to Token-Based Services

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

A retry storm at 3 a.m. usually starts the same way: a brief provider hiccup pushes a few requests over the rate limit, your client library retries them, those retries land on a still-recovering endpoint, more requests fail, and within ninety seconds your queue depth has gone vertical while your provider dashboard shows you sitting at 100% of your tokens-per-minute quota with a backlog measured in five-figure dollars. The post-mortem will say "thundering herd." The honest answer is that you built a fixed-throughput retry policy on top of a variable-capacity downstream and forgot that queue theory has opinions about that.

Most of the well-known service resilience patterns were written for downstreams whose throughput is a wall: a database with a connection pool, a microservice with a known concurrency limit. LLM providers are not that. Your effective throughput is a moving target shaped by your tier, the model you picked, the size of the prompt, the size of the response, the time of day, and whether someone else on the same provider is fine-tuning a frontier model right now. Treating it like a fixed pipe is the root cause of most of the LLM outages I've seen this year.

The Bias Audit You Keep Skipping: Engineering Demographic Fairness into Your LLM Pipeline

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

A team ships an LLM-powered feature. It clears the safety filter. It passes the accuracy eval. Users complain. Six months later, a researcher runs a 3-million-comparison study and finds the system selected white-associated names 85% of the time and Black-associated names 9% of the time — on identical inputs.

This is not a safety problem. It's a fairness problem, and the two require entirely different engineering responses. Safety filters guard against harm. Fairness checks measure whether your system produces equally good outputs for everyone. A model can satisfy every content policy you have and still diagnose Black patients at higher mortality risk than equally sick white patients, or generate thinner resumes for women than men. These disparities are invisible to the guardrail that blocked a slur.

Most teams never build the second check. This post is about why you should and exactly how to do it.

Continuous Fine-Tuning Without Data Contamination: The Production Pipeline

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

Most teams running continuous fine-tuning discover the contamination problem the same way: their eval metrics keep improving each week, the team celebrates, and then a user reports that the model has "gotten worse." When you investigate, you realize your evaluation benchmark has been quietly leaking into your training data for months. Every metric that looked like capability gain was memorization.

The numbers are worse than intuition suggests. LLaMA 2 had over 16% of MMLU examples contaminated — with 11% severely contaminated (more than 80% token overlap). GPT-2 scored 15 percentage points higher on contaminated benchmarks versus clean ones. These are not edge cases. In a continuous fine-tuning loop, contamination is the default outcome unless you architect explicitly against it.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

Debugging AI at 3am: Incident Response for LLM-Powered Systems

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

You're on-call. It's 3am. Your alert fires: customer satisfaction on the AI chat feature dropped 18% in the last hour. You open the logs and see... nothing. Every request returned HTTP 200. Latency is normal. No errors anywhere.

This is the AI incident experience. Traditional on-call muscle memory — grep for stack traces, find the exception, deploy the fix — doesn't work here. The system isn't broken. It's doing exactly what it was designed to do. The outputs are just wrong.

Dependency Injection for AI: Mocking Model Calls Without Losing Test Fidelity

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

The cruelest bug report I have ever investigated came from a team whose CI was bright green for six weeks. Every prompt change shipped through a full test suite. Every tool call had a mock. Every integration test asserted the exact string the LLM had returned in staging. And every one of those tests was lying. Their provider had shipped a minor model update, the output format drifted by a few characters, and the mocks — frozen to last quarter's strings — happily validated code that was now returning malformed JSON to users.

That is the shape of the failure mode I want to talk about. Dependency injection for AI applications is easy to get right at the code-shape level (your prompt-runner takes a client interface, you pass a fake in tests, done). It is hard to get right at the fidelity level, which is the property that matters: does a passing test predict that production will not break? Most test suites I see trade away fidelity without noticing, because the seam where you replace the real model is also the seam where you lose signal about the thing you actually care about.

The fix is not "mock more carefully." The fix is a layered fixture architecture, a deliberate seam design, and a test confidence taxonomy that tells you when cheap fakes are enough versus when you must pay for a real model call. Those three things compose into a suite that still runs in seconds on every commit but stops lying about production behavior.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

April 16, 2026 · 12 min read

Tian Pan

Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts

April 16, 2026 · 9 min read

Tian Pan

Software Engineer

A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.

The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.

Fine-Tuning Dataset Provenance: The Audit Question You Can't Answer Six Months Later

April 16, 2026 · 10 min read

Tian Pan

Software Engineer

Six months after you shipped your fine-tuned model, a regulator asks: "Which training examples came from users who have since revoked consent?" You open a spreadsheet, search a Slack archive, and find yourself reconstructing history from annotation batch emails and a README that hasn't been updated since the first sprint. This is the norm, not the exception. An audit of 44 major instruction fine-tuning datasets found over 70% of their licenses listed as "unspecified," with error rates above 50% in how license categories were actually applied. The provenance problem is structural, and it bites hardest when you can least afford it.

This post is about building a provenance registry for fine-tuning data before you need it — the schema, the audit scenarios that drive its requirements, and the production patterns that make it tractable without becoming a second job.

The Intent Classification Layer Most Agent Routers Skip

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

About Tian Pan