Skip to main content

702 posts tagged with "llm"

View all tags

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

Building LLM Evals from Sparse Annotations: You Don't Need 10,000 Examples

· 12 min read
Tian Pan
Software Engineer

Teams building LLM applications consistently make the same mistake: they wait for enough labeled data before investing in evaluation infrastructure. They tell themselves they need 5,000 examples. Or 10,000. The eval system stays on the backlog while "vibe checks" substitute for measurement. A ZenML analysis of 1,200 production deployments found that informal vibe checks remain common even in mature deployments — and many teams never graduate to systematic evals at all.

The data-size intuition is borrowed from classical ML, where more labeled examples reliably improved model performance. For LLM evaluation, it is largely wrong. Research on sparse benchmarks demonstrates that 20–40 carefully selected items reliably estimate full-benchmark rankings, and 100 items produce mean absolute error below 1% compared to thousands. The problem is not data volume. The problem is that most teams skip the structured process that makes small evaluation sets trustworthy.

This post covers what that process actually looks like: how to select the right examples through active learning, how to generate noisy labels at scale with weak supervision, how to bootstrap with LLM judges, and how to know when your small eval set is ready to use.

The Few-Shot Saturation Curve: Why Adding More Examples Eventually Hurts

· 9 min read
Tian Pan
Software Engineer

A team testing Gemini 3 Flash on a route optimization task watched their model score 93% accuracy at zero-shot. They added examples, performance climbed — and then at eight examples it collapsed to 30%. That's not noise. That's the few-shot saturation curve biting hard, and it's a failure mode most engineers only discover after deploying a prompt that seemed fine at four examples and broken at twelve.

The intuition that more examples is strictly better is wrong. The data across 12 LLMs and dozens of task types shows three distinct failure patterns: steady plateau (gains flatten), peak regression (gains then crash), and selection-induced collapse (gains that evaporate when you switch example retrieval strategy). Understanding which pattern you're in changes how you build prompts, when you give up on few-shot entirely, and whether you should be fine-tuning instead.

Graceful AI Feature Sunset: How to Deprecate a Model-Powered Feature Without Breaking User Trust

· 11 min read
Tian Pan
Software Engineer

When one provider announced the retirement of a widely-used model variant, engineering forums filled with farewell posts, petitions, and migration guides written by users who had built daily workflows around a specific model's behavioral fingerprint. That's not how software deprecation usually goes. When you remove a button from a UI, users are annoyed. When you remove an AI feature they've come to depend on, they grieve.

This asymmetry reveals something important: deprecating an AI-powered feature is categorically harder than deprecating a conventional feature. The behavioral envelope of an LLM — its tone, latency profile, formatting tendencies, response length — becomes as load-bearing as the feature's functional output. Users don't just rely on what the AI does. They rely on how it does it. If your sunset plan treats AI retirement the same as API endpoint retirement, you will pay for the mismatch in churn.

Grammar-Constrained Generation: The Output Reliability Technique Most Teams Skip

· 10 min read
Tian Pan
Software Engineer

Most teams that need structured LLM output follow the same playbook: write a prompt that says "respond only with valid JSON," parse the response, run Pydantic validation, and if it fails, retry with the error message appended. This works often enough to ship. It also fails in production at exactly the worst moments — under load, on edge-case inputs, and with cheaper models that don't follow instructions as reliably as GPT-4.

Grammar-constrained generation is a fundamentally different approach. Instead of asking the model nicely and checking afterward, it makes structurally invalid outputs mathematically impossible. The model cannot emit a missing brace, a non-existent enum value, or a required field it forgot — because those tokens are filtered out before sampling. Not unlikely. Impossible.

Most teams skip it. They shouldn't.

Hiring for LLM Engineering: What the Interview Actually Needs to Test

· 10 min read
Tian Pan
Software Engineer

Most engineering teams that hire for LLM roles run roughly the same interview: two rounds of LeetCode, a system design question, maybe a quiz on transformer internals. They're assessing for the wrong things — and they know it. The candidates who ace those screens often struggle to ship working AI features, while the ones who stumble on binary search can build an eval suite from scratch and debug a hallucinating pipeline in an afternoon.

The skills that predict success in LLM engineering have almost no overlap with what traditional ML or software interviews test. Hiring managers who haven't updated their process are generating false negatives at a high rate — rejecting engineers who would succeed — while false positives walk in with solid LeetCode scores and no intuition for when a model is confidently wrong.

Hot-Path vs. Cold-Path AI: The Architectural Decision That Decides Your p99

· 10 min read
Tian Pan
Software Engineer

Every AI feature you ship makes an architectural choice before it makes a product one: does this model call live inside the user's request, or does it run somewhere the user isn't waiting for it? The choice is usually made by whoever writes the first prototype, never revisited, and silently determines your p99 latency for the rest of the feature's life. When the post-mortem asks why a shipping dashboard became unusable at 10 a.m. every Monday, the answer is almost always that something which should have been cold-path got welded into the hot path — and a model that is fine at p50 becomes catastrophic at p99 when traffic fans out.

The hot-path / cold-path distinction is older than LLMs. CQRS, streaming architectures, lambda architectures — they all draw the same line between "must respond now" and "can arrive eventually." What's different about AI workloads is that the cost of crossing the line in the wrong direction is an order of magnitude higher than it used to be. A synchronous database query that takes 50 ms turning into 200 ms is a regression. A synchronous LLM call that takes 1.2 s at p50 turning into 11 s at p99 is a business decision you didn't know you made.

The Implicit API Contract: What Your LLM Provider Doesn't Document

· 10 min read
Tian Pan
Software Engineer

Your LLM provider's SLA covers HTTP uptime and Time to First Token. It says nothing about whether the model will still follow your formatting instructions next month, refuse requests it accepted last week, or return valid JSON under edge-case conditions you haven't tested. Most engineering teams discover this the hard way — via a production incident, not a changelog.

This is the implicit API contract problem. Traditional APIs promise stable, documented behavior. LLM providers promise a connection. Everything between the request and what your application does with the response is on you.

The Intent Classification Layer Most Agent Routers Skip

· 11 min read
Tian Pan
Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Keeping Synthetic Eval Data Honest

· 9 min read
Tian Pan
Software Engineer

A safety model scored 85.3% accuracy on its public benchmark test set. When researchers tested it on novel adversarial prompts not derived from public datasets, that number dropped to 33.8%. The model hadn't learned to reason about safety. It had learned to recognize the evaluation distribution.

This is the problem at the center of synthetic eval data: when the same model family generates both your training data and your test cases, passing the eval means conforming to a shared statistical prior—not demonstrating actual capability. It's a feedback loop that looks like quality assurance until production traffic arrives and the numbers don't hold.

The failure is structural, not incidental. And fixing it requires more than adding more synthetic examples.

Knowledge Graphs as a RAG Alternative: When Structured Retrieval Beats Embeddings

· 9 min read
Tian Pan
Software Engineer

Most RAG implementations fail in exactly the same way: the vector search retrieves something plausible but not what the user actually needed, the LLM wraps it in confident prose, and the user gets an answer that's approximately right but specifically wrong. The frustrating part is that the failure mode is invisible — cosine similarity scores look fine, the retrieved passages mention the right topics, but the answer is still wrong because the question required reasoning across relationships, not just semantic proximity.

Vector embeddings are excellent at one thing: finding text that sounds like your query. That's a powerful capability, and it covers an enormous range of production use cases. But it breaks predictably when the question depends on how entities connect to each other rather than how closely their descriptions match. For those queries, a knowledge graph — a property graph you traverse with Cypher or SPARQL — is not an optimization. It's a fundamentally different kind of retrieval that solves a different class of problem.

LLM Confidence Calibration in Production: Measuring and Fixing the Overconfidence Problem

· 10 min read
Tian Pan
Software Engineer

Your model says "I'm highly confident" and is wrong 40% of the time. That's not a hallucination — that's a calibration failure, and it's a harder problem to detect, measure, and fix in production.

Hallucination gets all the press. But overconfident wrong answers are often more dangerous: the model produces a plausible, fluent response with high expressed confidence, and there is no signal to the downstream consumer that anything is wrong. Hallucination detectors, RAG grounding checks, and fact-verification pipelines all help with fabricated content. They do almost nothing for the scenario where the model knows a fact but has systematically miscalibrated beliefs about how certain it is.

Most teams shipping LLM-powered features treat confidence as an afterthought. This post covers why calibration fails, how to measure it, and the production patterns that actually move the metric.