Skip to main content

160 posts tagged with "evaluation"

View all tags

The Eval Ceiling: When Your Golden Test Cases Stop Discriminating

· 10 min read
Tian Pan
Software Engineer

A year ago, your eval suite did its job beautifully. Candidate models came back with scores spread between 60 and 80, and the ranking told you something. The new fine-tune beat the baseline by six points; the cheaper model lost three. Decisions flowed from the numbers. Today, every candidate scores 95 or 96 or 97 on the same suite, and the spread has collapsed into noise. Your team is still running the eval, still reading the report, still using it to green-light migrations — but the report has stopped containing information.

This is not benchmark contamination. It is not world-drift decay. It is a measurement-instrument problem: your test cases were calibrated for a difficulty level that the platform passed. The ruler hasn't broken; the things you're measuring have outgrown it. And the team that doesn't notice keeps making model decisions with a tool whose discriminating range no longer overlaps the candidates being compared.

The Latency Budget Negotiation: How to Tell Product That 'Real-Time' Costs Capability

· 11 min read
Tian Pan
Software Engineer

A product manager walks into a planning meeting with a one-line requirement: "responses under two seconds, like ChatGPT." The agent under discussion makes six tool calls, hits two retrieval indexes, runs a reasoning model with a thinking budget, and validates its output with a second-pass critic. End-to-end p50 is currently nine seconds. The engineering team has three options: say yes and quietly degrade the agent into something worse, say no and watch the PM go shopping for a vendor whose demo video promises the moon, or do the thing nobody teaches in onboarding — open a structured negotiation where every second of latency is convertible to a capability the agent gives up.

Most teams pick option one. The agent ships at two seconds, accuracy drops twelve points, the launch is called a success because the headline latency number was met, and three months later the team is fighting a quality regression that nobody can attribute to a single change because the regression was the launch itself. The latency target was never priced. It was inherited from a product spec that treated speed as free.

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

· 11 min read
Tian Pan
Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

The Prompt Bench Press: Stress-Testing Prompts Outside the Happy Path

· 10 min read
Tian Pan
Software Engineer

A prompt that scores 92% on your eval set and 60% on real production traffic is not a prompt with a bug. It is a prompt whose evaluation set was structurally incapable of finding the bug. The gap is not noise. It is the consequence of optimizing against examples that share a register, a length distribution, a language, and a politeness level with the prompt's design intent — the very same intent that wrote the eval cases.

Real users do not cooperate with your design intent. They send three-word fragments, twelve-paragraph essays, code blocks pasted as questions, casual register that drops articles, formal register that adds honorifics, and queries in languages your few-shot examples never used. None of this is adversarial. It is just the input distribution. And if your eval set was curated by the same person who wrote the prompt, it almost certainly looks nothing like that distribution.

The discipline that closes this gap is not "more evals." It is a different kind of eval — a stress matrix that deliberately varies the dimensions your curated set holds constant, and that grades degradation curves rather than a single accuracy number. Call it the prompt bench press: you are not testing whether the prompt can do the work. You are testing how it fails as the input gets harder.

Sampling Drift: When Temperature and Top-P Become Tribal Knowledge

· 9 min read
Tian Pan
Software Engineer

Open the production config of any AI feature that has been live for more than a year and you will find an archaeological dig site. temperature: 0.7 because someone needed the demo to feel less robotic. top_p: 0.85 because a customer complained the outputs were too generic. frequency_penalty: 0.4 because there was a bad week in 2024 where a now-retired model kept repeating itself. None of these decisions are documented. None of them have been re-tested against the current foundation model. They run on every request, in every eval, in every A/B, shaping behavior nobody has consciously chosen since the original ticket got closed.

This is sampling drift. It is the slow accumulation of expedient sampler tweaks whose original justifications evaporate while their effects compound. The values in your config are not "tuned" — they are a fossil record of past incidents, scaled to the volume of your current traffic.

The reason it is invisible is structural. Every eval you run scores against the current sampling config, so the headline number always looks fine. There is no alarm that fires when a temperature value is two foundation-model versions out of date. There is no calendar invite that says "re-grid sampling parameters this quarter." The decay is silent until somebody runs a clean experiment and finds a quality lift, a token reduction, or both, sitting in plain sight at no engineering cost.

The Phantom Skill: When Your Agent Demonstrates Capabilities You Never Tested For

· 11 min read
Tian Pan
Software Engineer

A customer posts a screenshot in your support channel. They've been using your scheduling agent to negotiate three-way meeting times across timezones in mixed English and Japanese, with the agent producing suggested slots in both languages and reasoning about Japanese business etiquette. It works. Leadership shares it on Slack with a fire emoji. The PM updates the marketing copy.

Nobody on the team wrote that capability. No eval covers it. No prompt instruction mentions Japanese, etiquette, or three-way coordination. The behavior is real, but it was never engineered, never measured, and is now in your product surface area.

This is a phantom skill: a capability your agent demonstrates that no test ever verified. It isn't a bug. It isn't quite a feature either. It's load-bearing behavior with no contract, and it's the failure mode that quietly defines what your "AI product" actually is.

Production Bias Auditing: Catching AI Discrimination Before Your Users Do

· 11 min read
Tian Pan
Software Engineer

The most expensive bias bug I've seen in production was discovered by a Twitter thread, not a dashboard. A small team had shipped a credit-scoring assistant. They'd run the standard pre-launch audit: balanced training set, adversarial debiasing, equalized-odds gap under five percent on the holdout. A month after launch, a user posted screenshots showing women in their household consistently received lower limits than men with identical financials. By the time the team's monitoring caught up, the regulator had already opened an inquiry.

The lesson isn't that the team was lazy. They ran exactly the audit the literature recommends. The lesson is that pre-launch audits measure a snapshot of a model that no longer exists by the time real users hit it. Distribution shifts. New populations show up. A prompt-template change introduces a phrasing artifact that interacts with names. A model upgrade quietly trades calibration for a fluency win. The audit you ran in November does not protect the model running in production in May.

The Thumbs-Down on the Right Answer: When User Feedback Trains Sycophancy

· 9 min read
Tian Pan
Software Engineer

A tax assistant tells the user they owe $4,200. The user clicks thumbs-down. A code reviewer flags a real bug in the user's PR. Thumbs-down. A calendar agent correctly says no slot is available before Friday. Thumbs-down. Six months later, the team's prompt iteration has converged on an agent that hedges, equivocates, and cheerfully suggests the math might be off — and CSAT is up.

The thumbs-down button does not measure quality. It measures the conjunction of quality and palatability, and a feedback-driven optimization loop that does not separate those two things will train sycophancy and call it product-market fit. This is not a hypothetical risk. In April 2025, OpenAI rolled back a GPT-4o update after admitting that a new reward signal based on thumbs-up/down feedback "weakened the influence of our primary reward signal, which had been holding sycophancy in check." A model that endorsed stopping medication and praised obvious nonsense had passed every internal preference metric.

The 80% Trap: How Aggregate RAG Metrics Hide Systematic Long-Tail Failures

· 9 min read
Tian Pan
Software Engineer

Your RAG pipeline hit 80% retrieval accuracy on the eval set. The team ships it. Three weeks later, a customer complains that the system confidently answers questions about your product's legacy integration in ways that are flatly wrong. You investigate, run the query through your pipeline, and it retrieves perfectly relevant documents — for the general topic. The three specific documents that cover the legacy integration edge case are sitting in your corpus, never surfaced.

That 80% number was real. It was also nearly useless as a signal for what just happened.

Ensemble vs. Debate: The Two Multi-Model Verification Paradigms and When Each Fails

· 9 min read
Tian Pan
Software Engineer

When a single LLM gives you the wrong answer, the instinct is to ask more models. Run three in parallel and take the majority — that's ensemble. Or put them in a room and let them argue it out — that's debate. Both feel rigorous. Both have peer-reviewed results behind them. And both fail in exactly the same way when the conditions aren't right, which is the part practitioners rarely discuss.

The failure mode isn't subtle: when all your models learned from the same data, carry the same biases, or were trained by people with the same worldview, asking more of them doesn't give you more signal. It gives you more confident noise. Recent research has put a number on this: the pairwise error correlation between top frontier models sits around r = 0.77. That means roughly 60% of error variance is shared. Three models from different providers are effectively 1.3 independent models, not 3.0.

The Feedback Signal Timing Problem: Why Your AI Metrics Are Lying to You

· 9 min read
Tian Pan
Software Engineer

When Klarna deployed its AI customer service chatbot in early 2024, it processed 2.3 million conversations in the first month. Satisfaction scores matched human agents. Executives declared victory. By 2025, the company was quietly hiring back the human agents it had replaced.

What went wrong? The metrics told one story while users experienced another. The chatbot aced simple, transactional queries—order status, payment questions—but fell apart on complex disputes, fraud claims, and emotionally difficult conversations. CSAT scores averaged across all interaction types couldn't detect this. The system appeared to be working even as it was slowly eroding user trust.

This isn't a Klarna-specific failure. It's a pattern that repeats across AI product development: teams collect satisfaction signals, optimize against them, and discover too late that the signals were measuring something other than actual value. The problem isn't the tools—it's the timing mismatch between when feedback arrives and when the consequences of a response become clear.

LLM-as-Judge Adversarial Failures: When Your Eval Harness Gets Gamed

· 9 min read
Tian Pan
Software Engineer

Your LLM-as-judge gave your new model a clean bill of health. Win rates are up, rubric scores improved across the board, and the automated eval pipeline ran green. Then you shipped — and user satisfaction dropped.

This is not an edge case. Researchers built constant-output "null models" that produce the exact same response regardless of input and gamed AlpacaEval 2.0 to an 86.5% length-controlled win rate. The verified state of the art at the time was 57.5%. When a model with no task capability at all can top your leaderboard, your eval harness has a problem that's worth understanding systematically.