Skip to main content

720 posts tagged with "llm"

View all tags

The Bias Audit You Keep Skipping: Engineering Demographic Fairness into Your LLM Pipeline

· 10 min read
Tian Pan
Software Engineer

A team ships an LLM-powered feature. It clears the safety filter. It passes the accuracy eval. Users complain. Six months later, a researcher runs a 3-million-comparison study and finds the system selected white-associated names 85% of the time and Black-associated names 9% of the time — on identical inputs.

This is not a safety problem. It's a fairness problem, and the two require entirely different engineering responses. Safety filters guard against harm. Fairness checks measure whether your system produces equally good outputs for everyone. A model can satisfy every content policy you have and still diagnose Black patients at higher mortality risk than equally sick white patients, or generate thinner resumes for women than men. These disparities are invisible to the guardrail that blocked a slur.

Most teams never build the second check. This post is about why you should and exactly how to do it.

Context Compression Changes What Your Model Actually Sees

· 11 min read
Tian Pan
Software Engineer

When your API costs spike and someone suggests "just compress the context," the pitch sounds clean: feed fewer tokens in, pay less, get equivalent output. LLMLingua benchmarks show 20x compression on math reasoning with only 1.5% accuracy loss. What's not to like?

The problem is that those benchmarks measure what the compressed context scores on clean, curated test sets. They don't measure what happens when your agent quietly drops the constraint it was given three turns ago, or resolves a pronoun to the wrong entity, or confabulates an exact file path because the original tool output was summarized away. Context compression doesn't just reduce tokens — it changes what your model actually sees. And the gaps between the original context and the compressed version are reliably where your system will fail.

Continuous Fine-Tuning Without Data Contamination: The Production Pipeline

· 11 min read
Tian Pan
Software Engineer

Most teams running continuous fine-tuning discover the contamination problem the same way: their eval metrics keep improving each week, the team celebrates, and then a user reports that the model has "gotten worse." When you investigate, you realize your evaluation benchmark has been quietly leaking into your training data for months. Every metric that looked like capability gain was memorization.

The numbers are worse than intuition suggests. LLaMA 2 had over 16% of MMLU examples contaminated — with 11% severely contaminated (more than 80% token overlap). GPT-2 scored 15 percentage points higher on contaminated benchmarks versus clean ones. These are not edge cases. In a continuous fine-tuning loop, contamination is the default outcome unless you architect explicitly against it.

Debugging AI at 3am: Incident Response for LLM-Powered Systems

· 10 min read
Tian Pan
Software Engineer

You're on-call. It's 3am. Your alert fires: customer satisfaction on the AI chat feature dropped 18% in the last hour. You open the logs and see... nothing. Every request returned HTTP 200. Latency is normal. No errors anywhere.

This is the AI incident experience. Traditional on-call muscle memory — grep for stack traces, find the exception, deploy the fix — doesn't work here. The system isn't broken. It's doing exactly what it was designed to do. The outputs are just wrong.

The Dependency Injection Pattern for AI Applications: Writing Code That Survives Model Swaps

· 9 min read
Tian Pan
Software Engineer

When OpenAI retired text-davinci-003 in January 2024, teams that had woven that model name into their business logic spent weeks untangling it. Not because swapping a model is technically hard — it's a string and an API call — but because the model was entangled with everything else: prompt construction, response parsing, error handling, retry logic, all intertwined with the assumption that one specific provider would answer. The engineering cost of that kind of migration has been estimated at $50K–$100K for mid-size production systems, plus a month or more of diverted engineering attention.

The fix isn't exotic. It's a pattern every backend engineer already knows: dependency injection. The insight is that your business logic should depend on an abstraction of a language model, not a concrete client from OpenAI or Anthropic. Inject the concrete implementation at startup. The rest of the code never knows which provider is behind the interface.

Documenting Probabilistic Features: The Missing Layer Between Model Behavior and Developer Onboarding

· 10 min read
Tian Pan
Software Engineer

Your documentation says the /summarize endpoint returns a concise summary. That is true. It returns a different concise summary every time, sometimes misses a key point, occasionally returns structured JSON when you forgot to specify format in the prompt, and degrades silently after a model update you didn't know happened. None of this appears in the docs.

Traditional API documentation captures contracts: given input X, expect output Y. AI-powered features break that model at its foundation. There is no stable contract to document. The same prompt, same model, same parameters — different output. And yet teams ship these features with the same style of documentation they'd write for a database query: a function signature, a return type, maybe a sentence about error codes.

The gap between what your docs say and what your feature actually does is where developer trust goes to die.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

Building LLM Evals from Sparse Annotations: You Don't Need 10,000 Examples

· 12 min read
Tian Pan
Software Engineer

Teams building LLM applications consistently make the same mistake: they wait for enough labeled data before investing in evaluation infrastructure. They tell themselves they need 5,000 examples. Or 10,000. The eval system stays on the backlog while "vibe checks" substitute for measurement. A ZenML analysis of 1,200 production deployments found that informal vibe checks remain common even in mature deployments — and many teams never graduate to systematic evals at all.

The data-size intuition is borrowed from classical ML, where more labeled examples reliably improved model performance. For LLM evaluation, it is largely wrong. Research on sparse benchmarks demonstrates that 20–40 carefully selected items reliably estimate full-benchmark rankings, and 100 items produce mean absolute error below 1% compared to thousands. The problem is not data volume. The problem is that most teams skip the structured process that makes small evaluation sets trustworthy.

This post covers what that process actually looks like: how to select the right examples through active learning, how to generate noisy labels at scale with weak supervision, how to bootstrap with LLM judges, and how to know when your small eval set is ready to use.

The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts

· 9 min read
Tian Pan
Software Engineer

A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.

The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.

Graceful AI Feature Sunset: How to Deprecate a Model-Powered Feature Without Breaking User Trust

· 11 min read
Tian Pan
Software Engineer

When one provider announced the retirement of a widely-used model variant, engineering forums filled with farewell posts, petitions, and migration guides written by users who had built daily workflows around a specific model's behavioral fingerprint. That's not how software deprecation usually goes. When you remove a button from a UI, users are annoyed. When you remove an AI feature they've come to depend on, they grieve.

This asymmetry reveals something important: deprecating an AI-powered feature is categorically harder than deprecating a conventional feature. The behavioral envelope of an LLM — its tone, latency profile, formatting tendencies, response length — becomes as load-bearing as the feature's functional output. Users don't just rely on what the AI does. They rely on how it does it. If your sunset plan treats AI retirement the same as API endpoint retirement, you will pay for the mismatch in churn.

Grammar-Constrained Generation: The Output Reliability Technique Most Teams Skip

· 10 min read
Tian Pan
Software Engineer

Most teams that need structured LLM output follow the same playbook: write a prompt that says "respond only with valid JSON," parse the response, run Pydantic validation, and if it fails, retry with the error message appended. This works often enough to ship. It also fails in production at exactly the worst moments — under load, on edge-case inputs, and with cheaper models that don't follow instructions as reliably as GPT-4.

Grammar-constrained generation is a fundamentally different approach. Instead of asking the model nicely and checking afterward, it makes structurally invalid outputs mathematically impossible. The model cannot emit a missing brace, a non-existent enum value, or a required field it forgot — because those tokens are filtered out before sampling. Not unlikely. Impossible.

Most teams skip it. They shouldn't.

Hiring for LLM Engineering: What the Interview Actually Needs to Test

· 10 min read
Tian Pan
Software Engineer

Most engineering teams that hire for LLM roles run roughly the same interview: two rounds of LeetCode, a system design question, maybe a quiz on transformer internals. They're assessing for the wrong things — and they know it. The candidates who ace those screens often struggle to ship working AI features, while the ones who stumble on binary search can build an eval suite from scratch and debug a hallucinating pipeline in an afternoon.

The skills that predict success in LLM engineering have almost no overlap with what traditional ML or software interviews test. Hiring managers who haven't updated their process are generating false negatives at a high rate — rejecting engineers who would succeed — while false positives walk in with solid LeetCode scores and no intuition for when a model is confidently wrong.