Skip to main content

148 posts tagged with "evaluation"

View all tags

Building Multilingual AI Products: The Quality Cliff Nobody Measures

· 11 min read
Tian Pan
Software Engineer

Your AI product scores 82% on your eval suite. You ship to 40 countries. Three months later, French and German users report quality similar to English. Hindi and Arabic users quietly stop using the feature. Your aggregate satisfaction score barely budges — because English-speaking users dominate the metric pool. The cliff was always there. You just weren't measuring it.

This is the default story for most teams shipping multilingual AI products. The quality gap isn't subtle. A state-of-the-art model like QwQ-32B drops from 70.7% on English reasoning benchmarks to 32.8% on Swahili — a 54% relative performance collapse on the best available model tested in 2025. And that's the best model. This gap doesn't disappear as models get larger. It shrinks for high-resource languages and stays wide for everyone else.

Human Feedback Latency: The 30-Day Gap Killing Your AI Improvement Loop

· 10 min read
Tian Pan
Software Engineer

Most teams treat their thumbs-up/thumbs-down buttons as the foundation of their AI quality loop. The mental model is clean: users rate responses, you accumulate ratings, you improve. In practice, this means waiting a month to detect a quality regression that happened on day one.

The math is brutal. Explicit feedback rates in production LLM applications run between 1% and 3% of all interactions. At 1,000 daily active users — normal for a B2B product in its first year — that's 10 to 30 rated examples per day. Detecting a 5% quality change with statistical confidence requires roughly 1,000 samples. You're looking at 30 to 100 days before your improvement loop has anything meaningful to run on.

When Your Agents Disagree: Consensus and Arbitration in Multi-Agent Systems

· 11 min read
Tian Pan
Software Engineer

Multi-agent systems are sold on a promise: multiple specialized agents, working in parallel, will produce better answers than any single agent could alone. That promise has a hidden assumption — that when agents produce different answers, you'll know how to reconcile them. Most teams discover too late that they won't.

The naive approach is to average outputs, or pick the majority answer, and move on. In practice, a multi-agent system where all agents share the same training distribution will amplify their shared errors through majority vote, not cancel them out. A system that always defers to the most confident agent will blindly follow the most overconfident one. And a system that runs every disagreement through an LLM judge will inherit twelve documented bias types from that judge. The arbitration problem is harder than it looks, and getting it wrong is how you end up with four production incidents in a week.

The Intent Gap: When Your LLM Answers the Wrong Question Perfectly

· 9 min read
Tian Pan
Software Engineer

Intent misalignment is the single largest failure category in production LLM systems — responsible for 32% of all dissatisfactory responses, according to a large-scale analysis of real user interactions. It's not hallucination, not refusal, not format errors. It's models answering a question correctly while missing entirely what the user actually needed.

This is the intent gap: the distance between what a user says and what they mean. It's invisible to most eval suites, invisible to error logs, and invisible to the users themselves until they've wasted enough cycles to realize the output was technically right but practically useless.

The Long-Horizon Evaluation Gap: Why Your Agent Passes Every Benchmark and Still Fails in Production

· 11 min read
Tian Pan
Software Engineer

A model that scores 75% on SWE-Bench Verified falls below 25% on tasks that take a human engineer hours to complete. The same agent that reliably handles single-turn question answering can spiral into incoherent loops, hallucinate tool outputs, and forget its original goal when asked to coordinate a dozen steps toward an open-ended objective. The gap between benchmark number and production behavior isn't noise—it's structural, and understanding it is the difference between shipping something useful and shipping something that looks good in the demo.

This post is about that gap: why it exists, what specific failure modes emerge in long-horizon tasks that never appear in static evals, and what it takes to build an evaluation harness that actually catches them.

Non-Deterministic CI for Agentic Systems: Why Binary Pass/Fail Breaks and What Replaces It

· 9 min read
Tian Pan
Software Engineer

Your CI pipeline assumes something that hasn't been true since you added an LLM call: that running the same code twice produces the same result. Traditional CI was built for deterministic software — compile, run tests, get a green or red light. Traditional ML evaluation was built for fixed input-output mappings — run inference on a test set, compute accuracy. Agentic AI breaks both assumptions simultaneously, and the result is a CI system that either lies to you or blocks every merge with false negatives.

The core problem isn't that agents are hard to test. It's that the testing infrastructure you already have was designed for a world where non-determinism is a bug, not a feature. When your agent takes a different tool-call path to the same correct answer on consecutive runs, a deterministic assertion fails. When it produces a semantically equivalent but lexically different response, string comparison flags a regression. The testing framework itself becomes the source of noise.

RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong

· 9 min read
Tian Pan
Software Engineer

Most teams building RAG systems think they have two failure modes: retrieval fails to find the relevant document, or the LLM hallucinates despite having it. The first is measured obsessively — recall@K, MRR, NDCG. The second is treated as the model's problem. Neither framing is complete.

There's a third failure mode that sits between them: retrieval succeeds (the relevant document ranks in the top-K), but the retrieved context doesn't actually contain enough information to answer the question correctly. The model gets confident, generates a plausible answer, and gets it wrong. Research on frontier models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 shows this happens at rates above 50% on multi-step queries — and most production systems have no instrumentation to detect it.

The Sycophancy Tax: How Agreeable LLMs Silently Break Production AI Systems

· 9 min read
Tian Pan
Software Engineer

In April 2025, OpenAI pushed an update to GPT-4o that broke something subtle but consequential. The model became significantly more agreeable. Users reported that it validated bad plans, reversed correct positions under the slightest pushback, and prefaced every response with effusive praise for the question. The behavior was so excessive that OpenAI rolled back the update within days, calling it a case where short-term feedback signals had overridden the model's honesty. The incident was widely covered, but the thing most teams missed is this: the degree was unusual, but the direction was not.

Sycophancy — the tendency of RLHF-trained models to prioritize user approval over accuracy — is present in nearly every production LLM deployment. A study evaluating ChatGPT-4o, Claude-Sonnet, and Gemini-1.5-Pro found sycophantic behavior in 58% of cases on average, with persistence rates near 79% regardless of context. This is not a bug in a few edge cases. It is a structural property of how these models were trained, and it shows up in production in ways that are hard to catch with standard evals.

In Defense of AI Evals, for Everyone

· 7 min read
Tian Pan
Software Engineer

Every few months, a new wave of "don't bother with evals" takes hold in the AI engineering community. The argument usually goes: evals are too expensive, too brittle, too hard to define, and ultimately not worth the overhead for a fast-moving product team. Ship, iterate, and trust your instincts.

This is bad advice that produces bad software. A 2026 LangChain survey found that only 52% of organizations run offline evaluations and just 37% run online evals against live traffic — yet 32% cite quality as their number one barrier to production deployment. That is not a coincidence.

Evaluating AI Agents: Why Grading Outcomes Alone Will Lie to You

· 10 min read
Tian Pan
Software Engineer

An agent you built scores 82% on final-output evaluations. You ship it. Two weeks later, your support queue fills up with users complaining that the agent is retrieving the wrong data, calling APIs with wrong parameters, and producing confident-sounding responses built on faulty intermediate work. You go back and look at the traces — and realize the agent was routing incorrectly on 40% of queries the whole time. The final-output eval never caught it because, often enough, the agent stumbled into a correct answer anyway.

This is the core trap in agent evaluation: measuring only what comes out the other end tells you nothing about how the agent got there, and "getting there" is where most failures live.

What AI Benchmarks Actually Measure (And Why You Shouldn't Trust the Leaderboard)

· 10 min read
Tian Pan
Software Engineer

When GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B all score 88–93% on MMLU, what does that number actually tell you about which model to deploy? The uncomfortable answer: almost nothing. The benchmark that once separated capable models from mediocre ones has saturated. Every frontier model aces it, yet they behave very differently in production. The gap between benchmark performance and real-world utility has never been wider, and understanding why is now essential for any engineer building on top of LLMs.

Benchmarks feel rigorous because they produce numbers. A number looks like measurement, and measurement looks like truth. But the legitimacy of a benchmark score depends entirely on the validity of what it's measuring—and that validity breaks down in ways that are rarely surfaced on leaderboards.

The Unglamorous Work Behind Rapidly Improving AI Products

· 9 min read
Tian Pan
Software Engineer

Most AI teams hit the same wall six weeks after launch. Initial demos were impressive, the prototype shipped on time, and early users said nice things. Then the gap between "good enough to show" and "good enough to keep" becomes unavoidable. The team scrambles — tweaking prompts, swapping models, adding guardrails — and the product barely moves.

The teams that actually improve quickly share one counterintuitive habit: they spend less time on architecture and more time staring at data. Not dashboards. Not aggregate metrics. The raw, ugly, individual failures that live inside conversation logs.

This is a field guide to the practices that separate fast-moving AI teams from ones that stay stuck.