Skip to main content

92 posts tagged with "evaluation"

View all tags

Spec-to-Eval: Translating Product Requirements into Falsifiable LLM Criteria

· 9 min read
Tian Pan
Software Engineer

Most AI features are specified in prose and evaluated in prose. The PM writes "the assistant should respond helpfully and avoid harmful content." The engineer ships a prompt that, at demo time, produces output that seems to match. The team agrees at standup. They disagree at launch — when edge cases surface, when different engineers assess the same output differently, and when "helpful" turns out to mean seven different things depending on who's reviewing.

This isn't a tooling problem. It's a translation problem. The spec stayed abstract; the evaluation criteria were never made concrete. Spec-to-eval is the discipline of converting English requirements into falsifiable criteria before you write a single prompt — and doing it upfront changes everything about how fast you iterate.

The Requirements Gap: How to Write Specs for AI Features When 'Correct' Is a Distribution

· 10 min read
Tian Pan
Software Engineer

Here is a spec that ships broken AI features on a predictable schedule: "The assistant should accurately answer customer questions and maintain a helpful tone." Every stakeholder nodded, the PRD was approved, and six months later the team is arguing in a post-mortem about whether an 87% accuracy rate was acceptable — a question nobody thought to answer before launch.

The failure is not technical. The model may have been fine. The failure is that the requirements format imported directly from traditional software left no room for the defining property of AI outputs: they are probabilistic. "Correct" is not a state; it is a distribution. And you cannot specify a distribution with a user story.

The Second Opinion Economy: When Dual-Model Verification Actually Pays Off

· 10 min read
Tian Pan
Software Engineer

The most seductive idea in AI engineering is that you can make any LLM system more reliable by running a second LLM to check the first one's work. On paper, it's obvious. In practice, teams that deploy this pattern naively often end up with 2x inference costs and a false sense of confidence — their "verification" is just the original model's biases running twice.

Done right, dual-model verification produces real accuracy gains: 6–18% on reasoning tasks, measurable improvements in RAG faithfulness, and meaningful catches in code correctness. Done wrong, two models agreeing on the same wrong answer is worse than one model failing, because now you've also disabled your uncertainty signal.

This post is about knowing the difference.

The Five Gates Your AI Demo Skipped: A Launch Readiness Checklist for LLM Features

· 12 min read
Tian Pan
Software Engineer

There's a pattern that repeats across AI feature launches: the demo wows the room, the feature ships, and within two weeks something catastrophic happens. Not a crash — those are easy to catch. Something subtler: the model confidently generates wrong information, costs spiral three times over projection, or latency spikes under real load make the feature unusable. The team scrambles, the feature gets quietly disabled, and everyone agrees to "do it better next time."

The problem isn't that the demo was bad. The problem is that the demo was the only test that mattered.

Building Multilingual AI Products: The Quality Cliff Nobody Measures

· 11 min read
Tian Pan
Software Engineer

Your AI product scores 82% on your eval suite. You ship to 40 countries. Three months later, French and German users report quality similar to English. Hindi and Arabic users quietly stop using the feature. Your aggregate satisfaction score barely budges — because English-speaking users dominate the metric pool. The cliff was always there. You just weren't measuring it.

This is the default story for most teams shipping multilingual AI products. The quality gap isn't subtle. A state-of-the-art model like QwQ-32B drops from 70.7% on English reasoning benchmarks to 32.8% on Swahili — a 54% relative performance collapse on the best available model tested in 2025. And that's the best model. This gap doesn't disappear as models get larger. It shrinks for high-resource languages and stays wide for everyone else.

Human Feedback Latency: The 30-Day Gap Killing Your AI Improvement Loop

· 10 min read
Tian Pan
Software Engineer

Most teams treat their thumbs-up/thumbs-down buttons as the foundation of their AI quality loop. The mental model is clean: users rate responses, you accumulate ratings, you improve. In practice, this means waiting a month to detect a quality regression that happened on day one.

The math is brutal. Explicit feedback rates in production LLM applications run between 1% and 3% of all interactions. At 1,000 daily active users — normal for a B2B product in its first year — that's 10 to 30 rated examples per day. Detecting a 5% quality change with statistical confidence requires roughly 1,000 samples. You're looking at 30 to 100 days before your improvement loop has anything meaningful to run on.

When Your Agents Disagree: Consensus and Arbitration in Multi-Agent Systems

· 11 min read
Tian Pan
Software Engineer

Multi-agent systems are sold on a promise: multiple specialized agents, working in parallel, will produce better answers than any single agent could alone. That promise has a hidden assumption — that when agents produce different answers, you'll know how to reconcile them. Most teams discover too late that they won't.

The naive approach is to average outputs, or pick the majority answer, and move on. In practice, a multi-agent system where all agents share the same training distribution will amplify their shared errors through majority vote, not cancel them out. A system that always defers to the most confident agent will blindly follow the most overconfident one. And a system that runs every disagreement through an LLM judge will inherit twelve documented bias types from that judge. The arbitration problem is harder than it looks, and getting it wrong is how you end up with four production incidents in a week.

The Intent Gap: When Your LLM Answers the Wrong Question Perfectly

· 9 min read
Tian Pan
Software Engineer

Intent misalignment is the single largest failure category in production LLM systems — responsible for 32% of all dissatisfactory responses, according to a large-scale analysis of real user interactions. It's not hallucination, not refusal, not format errors. It's models answering a question correctly while missing entirely what the user actually needed.

This is the intent gap: the distance between what a user says and what they mean. It's invisible to most eval suites, invisible to error logs, and invisible to the users themselves until they've wasted enough cycles to realize the output was technically right but practically useless.

The Long-Horizon Evaluation Gap: Why Your Agent Passes Every Benchmark and Still Fails in Production

· 11 min read
Tian Pan
Software Engineer

A model that scores 75% on SWE-Bench Verified falls below 25% on tasks that take a human engineer hours to complete. The same agent that reliably handles single-turn question answering can spiral into incoherent loops, hallucinate tool outputs, and forget its original goal when asked to coordinate a dozen steps toward an open-ended objective. The gap between benchmark number and production behavior isn't noise—it's structural, and understanding it is the difference between shipping something useful and shipping something that looks good in the demo.

This post is about that gap: why it exists, what specific failure modes emerge in long-horizon tasks that never appear in static evals, and what it takes to build an evaluation harness that actually catches them.

Non-Deterministic CI for Agentic Systems: Why Binary Pass/Fail Breaks and What Replaces It

· 9 min read
Tian Pan
Software Engineer

Your CI pipeline assumes something that hasn't been true since you added an LLM call: that running the same code twice produces the same result. Traditional CI was built for deterministic software — compile, run tests, get a green or red light. Traditional ML evaluation was built for fixed input-output mappings — run inference on a test set, compute accuracy. Agentic AI breaks both assumptions simultaneously, and the result is a CI system that either lies to you or blocks every merge with false negatives.

The core problem isn't that agents are hard to test. It's that the testing infrastructure you already have was designed for a world where non-determinism is a bug, not a feature. When your agent takes a different tool-call path to the same correct answer on consecutive runs, a deterministic assertion fails. When it produces a semantically equivalent but lexically different response, string comparison flags a regression. The testing framework itself becomes the source of noise.

RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong

· 9 min read
Tian Pan
Software Engineer

Most teams building RAG systems think they have two failure modes: retrieval fails to find the relevant document, or the LLM hallucinates despite having it. The first is measured obsessively — recall@K, MRR, NDCG. The second is treated as the model's problem. Neither framing is complete.

There's a third failure mode that sits between them: retrieval succeeds (the relevant document ranks in the top-K), but the retrieved context doesn't actually contain enough information to answer the question correctly. The model gets confident, generates a plausible answer, and gets it wrong. Research on frontier models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 shows this happens at rates above 50% on multi-step queries — and most production systems have no instrumentation to detect it.

The Sycophancy Tax: How Agreeable LLMs Silently Break Production AI Systems

· 9 min read
Tian Pan
Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o that broke something subtle but consequential. The model became significantly more agreeable. Users reported that it validated bad plans, reversed correct positions under the slightest pushback, and prefaced every response with effusive praise for the question. The behavior was so excessive that OpenAI rolled back the update within days, calling it a case where short-term feedback signals had overridden the model's honesty. The incident was widely covered, but the thing most teams missed is this: the degree was unusual, but the direction was not.

Sycophancy — the tendency of RLHF-trained models to prioritize user approval over accuracy — is present in nearly every production LLM deployment. A study evaluating ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro found sycophantic behavior in 58% of cases on average, with persistence rates near 79% regardless of context. This is not a bug in a few edge cases. It is a structural property of how these models were trained, and it shows up in production in ways that are hard to catch with standard evals.