Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Goodhart's Law in Your LLM Eval Suite: When Optimizing the Score Breaks the System

· 9 min read
Tian Pan
Software Engineer

Andrej Karpathy put it bluntly: AI labs were "overfitting" to Arena rankings. One major lab privately evaluated 27 model variants before their public release, publishing only the top performer. Researchers estimated that selective submission alone could artificially inflate leaderboard scores by up to 112%. The crowdsourced evaluation system that everyone pointed to as ground truth had become a target — and once it became a target, it stopped being a useful measure.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. It's been well-understood in economics and policy for decades. In LLM engineering, it's actively destroying eval suites right now, often without the teams building them realizing it.

Machine-Readable Project Context: Why Your CLAUDE.md Matters More Than Your Model

· 8 min read
Tian Pan
Software Engineer

Most teams that adopt AI coding agents spend the first week arguing about which model to use. They benchmark Opus vs. Sonnet vs. GPT-4o on contrived examples, obsess over the leaderboard, and eventually pick something. Then they spend the next three months wondering why the agent keeps rebuilding the wrong abstractions, ignoring their test strategy, and repeatedly asking which package manager to use.

The model wasn't the problem. The context file was.

Every AI coding tool — Claude Code, Cursor, GitHub Copilot, Windsurf — reads a project-specific markdown file at the start of each session. These files go by different names: CLAUDE.md, .cursor/rules/, .github/copilot-instructions.md, AGENTS.md. But they share the same purpose: teaching the agent what it cannot infer from reading the code alone. The quality of this file now predicts output quality more reliably than the model behind it. Yet most teams write them once, badly, and never touch them again.

Measuring Real AI Coding Productivity: The Metrics That Survive the 90-Day Lag

· 9 min read
Tian Pan
Software Engineer

Most teams adopting AI coding tools hit the same wall. Month one looks like a success story: PR throughput is up, sprint velocity is climbing, and the engineering manager is putting together a slide deck to share with leadership. By month three, something has quietly gone wrong. Incidents creep up. Senior engineers are spending more time in review. A simple bug fix now requires understanding code nobody on the team actually wrote. The productivity gains have evaporated — but the measurement system never caught it.

The problem is that the metrics most teams reach for first — lines generated, PRs merged, story points burned — are the wrong unit of measurement for AI-assisted development. They measure the cost of producing code, not the cost of owning it. And AI has made production nearly free while leaving ownership costs untouched.

Quality-Aware Model Routing: Why Optimizing for Cost Alone Wrecks Your AI Product

· 9 min read
Tian Pan
Software Engineer

Every team that ships LLM routing starts the same way: sort models by price, send easy queries to the cheap one, hard queries to the expensive one, celebrate the 60% cost reduction. Six weeks later, someone notices that contract analysis accuracy dropped from 94% to 79%, the coding assistant started hallucinating API endpoints that don't exist, and customer satisfaction on complex support tickets fell off a cliff — all while the routing dashboard showed "95% quality maintained."

The problem isn't routing itself. Cost-optimized routing treats all quality degradation as equal, when in practice the queries you're downgrading are disproportionately the ones where quality matters most.

Spec-to-Eval: Translating Product Requirements into Falsifiable LLM Criteria

· 9 min read
Tian Pan
Software Engineer

Most AI features are specified in prose and evaluated in prose. The PM writes "the assistant should respond helpfully and avoid harmful content." The engineer ships a prompt that, at demo time, produces output that seems to match. The team agrees at standup. They disagree at launch — when edge cases surface, when different engineers assess the same output differently, and when "helpful" turns out to mean seven different things depending on who's reviewing.

This isn't a tooling problem. It's a translation problem. The spec stayed abstract; the evaluation criteria were never made concrete. Spec-to-eval is the discipline of converting English requirements into falsifiable criteria before you write a single prompt — and doing it upfront changes everything about how fast you iterate.

The Ambient AI Coherence Problem: When Every Feature Is AI-Powered, Nothing Feels Like One Product

· 9 min read
Tian Pan
Software Engineer

Most AI products get the individual features right and the product wrong. Search returns plausible results. The summary is coherent. The chat assistant gives reasonable advice. But when a user searches for "best plan for small teams," gets a recommendation in the sidebar, asks the assistant a follow-up question, and then reads an auto-generated summary of their options — and all four contradict each other — none of the features feel trustworthy anymore. This is the ambient AI coherence problem: not hallucination in isolation, but contradiction at the product level.

The failure mode is subtle enough that teams often miss it entirely. Individual feature evals look fine. The search team measures recall and precision. The summarization team measures faithfulness. The chat team measures task completion. Nobody measures whether the AI-powered features of the product tell the same story about the same facts.

The Inference Cost Paradox: Why Your AI Bill Goes Up as Models Get Cheaper

· 10 min read
Tian Pan
Software Engineer

In 2021, GPT-3 cost 60permilliontokens.Byearly2026,youcouldbuyequivalentperformancefor60 per million tokens. By early 2026, you could buy equivalent performance for 0.06. That is a 1,000x reduction in three years. During the same period, enterprise AI spending grew 320% — from 11.5billionto11.5 billion to 37 billion. The organizations spending the most on AI are overwhelmingly the ones that benefited most from falling prices.

This is not a contradiction. It is the Jevons Paradox, and it is running your AI budget.

The Inference-Time Personalization Trap: When User Context Costs More Than It Earns

· 9 min read
Tian Pan
Software Engineer

There's a pattern that shows up in nearly every AI product once it hits a few hundred thousand active users: the team adds personalization — injects user history, preference signals, behavioral data into every prompt — and watches the product get slightly better while the infrastructure bill gets significantly worse. When they finally pull the logs and measure the quality delta per added token, the curve is almost always the same shape: steep gains early, then a long plateau, then diminishing returns you're paying full price for.

Most teams don't run that analysis until they're already in the hole. This post is about why the trap exists, where personalization stops paying, and what the architectures that actually work look like in production.

The LLM Forgery Problem: When Your Model Builds a Convincing Case for the Wrong Answer

· 10 min read
Tian Pan
Software Engineer

Your model wrote a detailed, well-structured analysis. Every sentence was grammatically correct and internally consistent. The individual facts it cited were accurate. And yet the conclusion was wrong — not because the model lacked the information to get it right, but because it had already decided on the answer before it started reasoning.

This is not hallucination. Hallucination is when a model fabricates facts. The forgery problem is subtler and, in production systems, harder to catch: the model reaches a conclusion first, then constructs a plausible-sounding chain of evidence to support it. The facts are real. The synthesis is a lie.

Engineers who haven't encountered this failure mode yet will. It shows up in every domain where LLMs are asked to do analysis — code review, document summarization, risk assessment, question answering over a knowledge base. The model sounds authoritative. It cites real evidence. And it has quietly ignored everything that pointed the other way.

The Metered AI Pricing Death Spiral: Why Per-Token Billing Punishes Your Best Features

· 8 min read
Tian Pan
Software Engineer

Token costs dropped 280x in two years. Enterprise AI bills went up 320%. If that sounds like a paradox, you haven't looked closely at how per-token billing interacts with the features that actually make AI products valuable.

The most useful AI workflows — deep research, multi-step reasoning, iterative refinement, agentic tool use — are precisely the ones that consume the most tokens. Under pure usage-based pricing, your best features are your worst margin killers. This isn't a temporary scaling problem. It's a structural misalignment between how AI creates value and how it gets billed.

The Requirements Gap: How to Write Specs for AI Features When 'Correct' Is a Distribution

· 10 min read
Tian Pan
Software Engineer

Here is a spec that ships broken AI features on a predictable schedule: "The assistant should accurately answer customer questions and maintain a helpful tone." Every stakeholder nodded, the PRD was approved, and six months later the team is arguing in a post-mortem about whether an 87% accuracy rate was acceptable — a question nobody thought to answer before launch.

The failure is not technical. The model may have been fine. The failure is that the requirements format imported directly from traditional software left no room for the defining property of AI outputs: they are probabilistic. "Correct" is not a state; it is a distribution. And you cannot specify a distribution with a user story.

The Second Opinion Economy: When Dual-Model Verification Actually Pays Off

· 10 min read
Tian Pan
Software Engineer

The most seductive idea in AI engineering is that you can make any LLM system more reliable by running a second LLM to check the first one's work. On paper, it's obvious. In practice, teams that deploy this pattern naively often end up with 2x inference costs and a false sense of confidence — their "verification" is just the original model's biases running twice.

Done right, dual-model verification produces real accuracy gains: 6–18% on reasoning tasks, measurable improvements in RAG faithfulness, and meaningful catches in code correctness. Done wrong, two models agreeing on the same wrong answer is worse than one model failing, because now you've also disabled your uncertainty signal.

This post is about knowing the difference.