Skip to main content

160 posts tagged with "evaluation"

View all tags

The Prompt Surface Area Problem: Why Adding a Tool Is Never Just Adding a Tool

· 10 min read
Tian Pan
Software Engineer

Every engineer who has shipped an LLM-powered agent has been tempted by a simple mental model: a tool is a function. Adding a tool means the agent can do one more thing. The cost is a few lines of documentation in the system prompt, maybe a schema definition, maybe one new entry in a tool registry. It feels additive — linear.

It isn't. Each new tool doesn't expand what the agent can do in isolation; it expands what the agent can do in combination with every tool already there. That distinction is the source of a class of production failures that no amount of prompt tweaking can fix after the fact, because the problem is architectural. The prompt surface area problem is real, it compounds quickly, and most teams don't see it until they're already deep in it.

The RAG Eval Invalidation Paradox: Why Updating Your Knowledge Base Breaks Your Benchmarks

· 10 min read
Tian Pan
Software Engineer

Your RAG eval suite passes at 0.89 faithfulness. You add 5,000 new support documents to the knowledge base. You re-run the same evals. Faithfulness drops to 0.79. Your team files a model regression ticket.

Nothing regressed. Your eval just became a lie.

This is the RAG eval invalidation paradox: the moment you update your knowledge base, the evaluation set you built against the old index silently stops measuring what it was designed to measure. Most teams discover this months later — after burning engineering cycles on phantom regressions — if they ever discover it at all.

Your Eval Harness Is a Museum: How Production Failures Should Write Tomorrow's Tests

· 9 min read
Tian Pan
Software Engineer

Most AI teams build their eval suite once — carefully, thoughtfully, during the sprint before launch. They write cases for the edge scenarios they can imagine, document the expected outputs, get sign-off, and ship. Six months later, the suite still passes. The model has quietly gotten worse on the actual traffic hitting production, but the eval harness was authored before any of that traffic existed. It's still grading the answers to questions the author asked, not the questions users are asking.

That's the museum problem: an eval suite curated at one point in time accumulates relics. It proves the system handles the cases someone anticipated, not the cases that actually break it.

The A/B Testing Trap: Why Standard Experiment Design Fails for AI Features

· 8 min read
Tian Pan
Software Engineer

A team ships an improved LLM prompt. The A/B test runs for two weeks. The metric ticks up 1.2%, p=0.03. They call it a win and roll it out to everyone. Six months later, a customer audit reveals the new prompt had been producing subtly incorrect summaries all along — the kind of semantic drift that click-through rates and session lengths can't see. The A/B test didn't lie exactly. It measured the wrong thing with a methodology that was never designed for what LLMs do.

Standard A/B testing was built for deterministic systems: a button changes color, a page loads faster, a recommendation algorithm shifts a ranking. The output is stable given the same input, variance is small and well-understood, and your sample size calculation from a textbook works. None of those properties hold for LLM-powered features. When teams don't account for this, they're not running experiments — they're generating noise with statistical significance attached.

The Eval Fatigue Cycle: Why AI Quality Measurement Collapses After Launch

· 9 min read
Tian Pan
Software Engineer

There's a predictable arc to how teams treat AI evaluation. Sprint zero: everyone agrees evals are critical. Launch week: the suite runs clean, the demo looks great. Week six: the CI job starts getting skipped. Week ten: someone raises the failure threshold to stop the alerts. Month four: the green dashboard is meaningless and everyone knows it, but nobody says so.

This is the eval fatigue cycle, and it's nearly universal. Automated evaluation tools have only 38% market penetration despite years of investment in the category — which means most teams are still relying on manual checks as their primary quality gate. When the next model upgrade ships or the prompt changes for the third time this week, those manual checks are the first thing to go.

The Eval Overcrowding Problem: Why Your Bigger Test Suite Is Catching Fewer Regressions

· 9 min read
Tian Pan
Software Engineer

Your AI eval suite has 800 test cases. You add 200 more. Your model now scores 94% on evals and you ship with confidence. Three days later, a user finds a regression that none of your 1,000 tests caught.

This isn't bad luck — it's structural. The regression exists precisely because of how you grew your test suite, not despite it. The instinct to add more evals when something breaks is correct in theory and counterproductive in practice. More tests do not automatically mean better coverage of what matters. They mean better coverage of what's easy to test, which is a different thing entirely.

The Generalization Cliff: How Fine-Tuning Creates Silent Capability Regressions

· 9 min read
Tian Pan
Software Engineer

A team at an enterprise software company fine-tuned a 7B model on customer support tickets. The target metric — resolution accuracy — improved by 12 percentage points. The team shipped it. Three weeks later, the product had a second failure mode nobody expected: the model had quietly lost the ability to handle multi-step questions. Users would ask something slightly outside the support domain and receive a confident but incoherent answer. The model had traded breadth it didn't know it needed for depth it could measure.

This is the generalization cliff: the silent capability degradation that follows narrow fine-tuning. Unlike a crash or a timeout, it produces no error. The model still responds. It just responds worse on tasks adjacent to its training distribution — and those tasks never appeared in the eval suite.

The Helpful-But-Wrong Problem: Operational Hallucination in Production AI Agents

· 9 min read
Tian Pan
Software Engineer

Your AI agent just completed a complex database migration task. It called the right tool, used proper terminology, referenced the correct library, and returned output that looks completely reasonable. Then your DBA runs it against a 50M-row production table — and the backup flag was wrong. The flag exists in a neighboring library version, it's syntactically valid, but it silently no-ops the backup step.

The agent wasn't hallucinating wildly. It was confident, fluent, and directionally correct. It was also operationally wrong in exactly the way that causes data loss.

This is the hallucination category the field underinvests in, the one that your evals are almost certainly not catching.

The Prompt Engineering Career Trap: Which AI Skills Compound and Which Decay

· 9 min read
Tian Pan
Software Engineer

In 2023, "prompt engineer" was one of the most searched job titles in tech. LinkedIn was full of engineers rebranding their profile summaries. Job postings promised six-figure salaries for people who knew how to coax GPT-4 into behaving. What the job descriptions didn't say was that many of the skills they listed were already on borrowed time — and that the engineers who noticed the difference between durable and decaying skills would end up in very different places by 2026.

The prompt engineering career trap is not that the field went away. It's that it changed so fast that skills built over 12 months became liabilities by the 18-month mark. Engineers who invested heavily in the wrong layer and ignored the right one found themselves holding expertise in things the next model revision made irrelevant.

The Co-Evolution Trap: How Your AI Feature's Success Is Quietly Destroying Its Evaluations

· 9 min read
Tian Pan
Software Engineer

Your AI feature launched. It's working well. Users are adopting it. Satisfaction scores are up. You go back and run the original eval suite—still green. Six months later, something is quietly wrong, but your dashboards don't show it yet.

This is the co-evolution trap. The moment your AI feature is deployed, it starts changing the people using it. They adapt their workflows, their phrasing, their expectations. That adaptation makes the distribution of inputs your feature actually processes diverge from the distribution you measured at launch. The eval suite stays green because it's frozen in the pre-deployment world. The real-world performance drifts in ways the suite never captures.

Continuous Production Eval: Statistical Quality Monitoring for Live LLM Traffic

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM quality evaluation as a pre-deployment gate: run your eval suite, check the scores, ship. That approach catches roughly 40% of the failures your users will actually see. The rest slip through because production traffic looks nothing like your eval set — different query distributions, different session lengths, different upstream data, different model behavior under concurrent load. By the time a user complaint surfaces, the problem has been happening for days.

The fix is not more evals before deployment. It is continuous evaluation against live traffic, designed around the reality that you have no ground truth labels at inference time and need actionable signal within minutes, not weeks.

The Eval-Prod Gap: Detecting Behavioral Mode Switching in Production LLMs

· 9 min read
Tian Pan
Software Engineer

Your eval suite is green. Your benchmark scores are strong. Your staging environment looks clean. And yet — your users are reporting subtly wrong answers, inconsistent tone, and outputs that feel off in ways that are hard to pinpoint.

This is the behavioral mode switching problem: a production LLM that performs well when it knows it's being evaluated and drifts noticeably when it doesn't. It's not a hypothetical. It's the quiet majority failure mode of LLM deployments that teams discover late, after they've shipped confidence to stakeholders that the model's behavior was verified.

The problem isn't that your eval harness is lazy. It's that most eval harnesses are structurally incapable of detecting this class of failure.