Skip to main content

567 posts tagged with "llm"

View all tags

The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?

· 11 min read
Tian Pan
Software Engineer

A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.

Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.

Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.

Stop Writing Prompts by Hand: Automated Optimization with DSPy and MIPRO

· 9 min read
Tian Pan
Software Engineer

You are going to spend an afternoon tuning a prompt. You'll move a sentence around, swap "classify" for "categorize," add a note about edge cases, and run spot-checks against a handful of examples you keep in a notebook. By end of day the prompt is marginally better — you think. You can't prove it. You don't have a reproducible baseline. A week later a colleague changes a few words and the whole thing regresses.

This is the current state of prompt engineering at most teams. DSPy is Stanford's answer to it. Rather than hand-authoring instruction prose, you declare what your LLM program should do, define a metric, and let an optimizer compile the actual prompts for you. MIPRO — the Multi-prompt Instruction PRoposal Optimizer — is the algorithm that makes this approach competitive with (and often better than) the human-crafted alternative.

Backpressure for LLM Pipelines: Queue Theory Applied to Token-Based Services

· 11 min read
Tian Pan
Software Engineer

A retry storm at 3 a.m. usually starts the same way: a brief provider hiccup pushes a few requests over the rate limit, your client library retries them, those retries land on a still-recovering endpoint, more requests fail, and within ninety seconds your queue depth has gone vertical while your provider dashboard shows you sitting at 100% of your tokens-per-minute quota with a backlog measured in five-figure dollars. The post-mortem will say "thundering herd." The honest answer is that you built a fixed-throughput retry policy on top of a variable-capacity downstream and forgot that queue theory has opinions about that.

Most of the well-known service resilience patterns were written for downstreams whose throughput is a wall: a database with a connection pool, a microservice with a known concurrency limit. LLM providers are not that. Your effective throughput is a moving target shaped by your tier, the model you picked, the size of the prompt, the size of the response, the time of day, and whether someone else on the same provider is fine-tuning a frontier model right now. Treating it like a fixed pipe is the root cause of most of the LLM outages I've seen this year.

The Bias Audit You Keep Skipping: Engineering Demographic Fairness into Your LLM Pipeline

· 10 min read
Tian Pan
Software Engineer

A team ships an LLM-powered feature. It clears the safety filter. It passes the accuracy eval. Users complain. Six months later, a researcher runs a 3-million-comparison study and finds the system selected white-associated names 85% of the time and Black-associated names 9% of the time — on identical inputs.

This is not a safety problem. It's a fairness problem, and the two require entirely different engineering responses. Safety filters guard against harm. Fairness checks measure whether your system produces equally good outputs for everyone. A model can satisfy every content policy you have and still diagnose Black patients at higher mortality risk than equally sick white patients, or generate thinner resumes for women than men. These disparities are invisible to the guardrail that blocked a slur.

Most teams never build the second check. This post is about why you should and exactly how to do it.

Context Compression Changes What Your Model Actually Sees

· 11 min read
Tian Pan
Software Engineer

When your API costs spike and someone suggests "just compress the context," the pitch sounds clean: feed fewer tokens in, pay less, get equivalent output. LLMLingua benchmarks show 20x compression on math reasoning with only 1.5% accuracy loss. What's not to like?

The problem is that those benchmarks measure what the compressed context scores on clean, curated test sets. They don't measure what happens when your agent quietly drops the constraint it was given three turns ago, or resolves a pronoun to the wrong entity, or confabulates an exact file path because the original tool output was summarized away. Context compression doesn't just reduce tokens — it changes what your model actually sees. And the gaps between the original context and the compressed version are reliably where your system will fail.

Continuous Fine-Tuning Without Data Contamination: The Production Pipeline

· 11 min read
Tian Pan
Software Engineer

Most teams running continuous fine-tuning discover the contamination problem the same way: their eval metrics keep improving each week, the team celebrates, and then a user reports that the model has "gotten worse." When you investigate, you realize your evaluation benchmark has been quietly leaking into your training data for months. Every metric that looked like capability gain was memorization.

The numbers are worse than intuition suggests. LLaMA 2 had over 16% of MMLU examples contaminated — with 11% severely contaminated (more than 80% token overlap). GPT-2 scored 15 percentage points higher on contaminated benchmarks versus clean ones. These are not edge cases. In a continuous fine-tuning loop, contamination is the default outcome unless you architect explicitly against it.

Debugging AI at 3am: Incident Response for LLM-Powered Systems

· 10 min read
Tian Pan
Software Engineer

You're on-call. It's 3am. Your alert fires: customer satisfaction on the AI chat feature dropped 18% in the last hour. You open the logs and see... nothing. Every request returned HTTP 200. Latency is normal. No errors anywhere.

This is the AI incident experience. Traditional on-call muscle memory — grep for stack traces, find the exception, deploy the fix — doesn't work here. The system isn't broken. It's doing exactly what it was designed to do. The outputs are just wrong.

The Dependency Injection Pattern for AI Applications: Writing Code That Survives Model Swaps

· 9 min read
Tian Pan
Software Engineer

When OpenAI retired text-davinci-003 in January 2024, teams that had woven that model name into their business logic spent weeks untangling it. Not because swapping a model is technically hard — it's a string and an API call — but because the model was entangled with everything else: prompt construction, response parsing, error handling, retry logic, all intertwined with the assumption that one specific provider would answer. The engineering cost of that kind of migration has been estimated at $50K–$100K for mid-size production systems, plus a month or more of diverted engineering attention.

The fix isn't exotic. It's a pattern every backend engineer already knows: dependency injection. The insight is that your business logic should depend on an abstraction of a language model, not a concrete client from OpenAI or Anthropic. Inject the concrete implementation at startup. The rest of the code never knows which provider is behind the interface.

Documenting Probabilistic Features: The Missing Layer Between Model Behavior and Developer Onboarding

· 10 min read
Tian Pan
Software Engineer

Your documentation says the /summarize endpoint returns a concise summary. That is true. It returns a different concise summary every time, sometimes misses a key point, occasionally returns structured JSON when you forgot to specify format in the prompt, and degrades silently after a model update you didn't know happened. None of this appears in the docs.

Traditional API documentation captures contracts: given input X, expect output Y. AI-powered features break that model at its foundation. There is no stable contract to document. The same prompt, same model, same parameters — different output. And yet teams ship these features with the same style of documentation they'd write for a database query: a function signature, a return type, maybe a sentence about error codes.

The gap between what your docs say and what your feature actually does is where developer trust goes to die.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

Building LLM Evals from Sparse Annotations: You Don't Need 10,000 Examples

· 12 min read
Tian Pan
Software Engineer

Teams building LLM applications consistently make the same mistake: they wait for enough labeled data before investing in evaluation infrastructure. They tell themselves they need 5,000 examples. Or 10,000. The eval system stays on the backlog while "vibe checks" substitute for measurement. A ZenML analysis of 1,200 production deployments found that informal vibe checks remain common even in mature deployments — and many teams never graduate to systematic evals at all.

The data-size intuition is borrowed from classical ML, where more labeled examples reliably improved model performance. For LLM evaluation, it is largely wrong. Research on sparse benchmarks demonstrates that 20–40 carefully selected items reliably estimate full-benchmark rankings, and 100 items produce mean absolute error below 1% compared to thousands. The problem is not data volume. The problem is that most teams skip the structured process that makes small evaluation sets trustworthy.

This post covers what that process actually looks like: how to select the right examples through active learning, how to generate noisy labels at scale with weak supervision, how to bootstrap with LLM judges, and how to know when your small eval set is ready to use.

The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts

· 9 min read
Tian Pan
Software Engineer

A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.

The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.