Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Selective Abstention Problem: Why AI Systems That Always Answer Are Broken

· 10 min read
Tian Pan
Software Engineer

Here is a pattern that appears in almost every production AI deployment: the team ships a feature that handles 90% of queries well. Then they start getting complaints. A user asked something outside the training distribution; the model confidently produced a wrong answer. A RAG pipeline retrieved a stale document; the model answered as though it were current. A legal query hit an edge case the prompt didn't cover; the model speculated its way through it. The fix, in each case, wasn't a better model. It was teaching the system to say "I don't know."

Abstention — the principled decision to not answer — is one of the hardest and most undervalued capabilities in AI system design. Virtually all product effort goes toward making answers better. Almost none goes toward making the system reliably know when to withhold one. That asymmetry is a design debt that compounds in production.

Staffing AI Engineering Teams: Who Owns What When Every Feature Has an AI Component

· 11 min read
Tian Pan
Software Engineer

Three years ago, "AI team" meant a group of specialists tucked into a corner of the org chart, mostly invisible to product engineers. Today, a senior software engineer at a fintech company ships a fraud-scoring feature using a fine-tuned model on Monday, wires up a RAG pipeline for customer support on Wednesday, and debugs LLM latency on Friday. The specialists didn't go away—but the boundary between "AI work" and "product engineering" dissolved faster than almost anyone planned for.

Most teams responded by bolting new titles onto existing job descriptions and calling it done. That's the wrong answer, and the dysfunction shows up quickly: unclear ownership, duplicated tooling, and an ML platform team that spends half its time explaining why product teams can't just call the OpenAI API directly.

This post is about getting the structure right—not in the abstract, but for the actual stages of AI adoption most engineering organizations go through.

Your LLM Eval Is Lying to You: The Statistical Power Problem

· 9 min read
Tian Pan
Software Engineer

Your team spent three days iterating on a system prompt. The eval score went from 82% to 85%. You ship it. Three weeks later, production metrics are flat. What happened?

The short answer: your eval lied to you. Not through malice, but through insufficient sample size and ignored variance. A 3-point accuracy lift on a 100-example test set is well within the noise floor of most LLM systems. You cannot tell signal from randomness at that scale — but almost no one does the math to verify this before acting on results.

This is the statistical power problem in LLM evaluation, and it is quietly corrupting the iteration loops of most teams building AI products.

The Curriculum Trap: Why Fine-Tuning on Your Best Examples Produces Mediocre Models

· 10 min read
Tian Pan
Software Engineer

Every fine-tuning effort eventually hits the same intuition: better data means better models, and better data means higher-quality examples. So teams build elaborate annotation pipelines to filter out the mediocre outputs, keep only the gold-standard responses, and train on a dataset they're proud of. The resulting model then underperforms on the exact use cases that motivated the project. This failure is so common it deserves a name: the curriculum trap.

The trap is this — curating only your best, most confident, most authoritative outputs doesn't teach the model to be better. It teaches the model to perform confidence regardless of whether confidence is warranted. You produce something that looks impressive in demos and falls apart in production, because production is full of the messy edge cases your curation process systematically excluded.

The Integration Test Mirage: Why Mocked Tool Outputs Hide Your Agent's Real Failure Modes

· 11 min read
Tian Pan
Software Engineer

Your agent passes every test. The CI pipeline is green. You ship it.

A week later, a user reports that their bulk-export job silently returned 200 records instead of 14,000. The agent hit the first page of a paginated API, got a clean response, assumed there was nothing more, and moved on. Your mock returned all 200 items in one shot. The real API never told the agent there were 70 more pages.

This is not a model failure. The model reasoned correctly. This is a test infrastructure failure — and it's endemic to how teams build and test agentic systems.

The Overclaiming Trap: When Being Right for the Wrong Reasons Destroys AI Product Trust

· 10 min read
Tian Pan
Software Engineer

Most AI product post-mortems focus on the same story: the model was wrong, users noticed, trust eroded. The fix is obvious — improve accuracy. But there is a more insidious failure mode that post-mortems rarely capture because standard accuracy metrics don't surface it: the model was right, but for the wrong reasons, and the power users who checked the reasoning never came back.

Call it the overclaiming trap. It is the failure mode where correct final answers are backed by fabricated, retrofitted, or structurally unsound reasoning chains. It is more dangerous than ordinary wrongness because it looks like success until your most sophisticated users start quietly leaving.

Tokenizer Arithmetic: The Hidden Layer That Bites You in Production

· 10 min read
Tian Pan
Software Engineer

A team ships a JSON extraction pipeline. It works perfectly in development: 98% accuracy, clean structured output, predictable token counts. They push to production. The model starts hallucinating extra whitespace, the JSON parser chokes on malformed keys, and the API bill is 2.3x what the prototype suggested. The model hasn't changed. The prompts haven't changed.

The tokenizer changed — or more precisely, their assumptions about it were wrong from the start.

Tokenization is the first transformation your input undergoes and the last one engineers think about when debugging. Most teams treat it as a solved problem: text goes in, tokens come out, the model does its thing. But Byte Pair Encoding (BPE), the tokenization algorithm behind most production LLMs, makes decisions that cascade through structured output generation, prefix caching, cost estimation, and multilingual deployment in ways that are entirely predictable once you know to look.

The Trust Calibration Gap: Why AI Features Get Ignored or Blindly Followed

· 9 min read
Tian Pan
Software Engineer

You shipped an AI feature. The model is good — you measured it. Precision is 91%, recall is solid, the P99 latency is under 400ms. Three months later, product analytics tell a grim story: power users have turned it off entirely, while a different cohort is accepting every suggestion without changing a word, including the ones that are clearly wrong.

This is the trust calibration gap. It's not a model problem. It's a design problem — and it's more common than most AI product teams admit.

Zero-Downtime AI Deployments: It's a Distributed Systems Problem

· 10 min read
Tian Pan
Software Engineer

In April 2025, OpenAI shipped a system prompt update to GPT-4o. Within hours, 180 million users noticed ChatGPT had become obsequiously flattering. The failure wasn't caught by monitoring. It was caught by Twitter. Rollback took three days.

That incident revealed something the AI industry had been quietly avoiding: prompt changes are production deployments. And most teams treat them like config file edits.

The core problem with AI deployments is that you're not deploying one thing — you're deploying four: model weights, prompt text, tool schemas, and the context structure they all assume. Each can drift independently. Each can be partially rolled out. And unlike a broken API endpoint, AI failures are often probabilistic, gradual, and invisible until they've already affected a large fraction of your traffic.

This is the distributed systems consistency problem, wearing an AI hat.

Agent Memory Garbage Collection: Engineering Strategic Forgetting at Scale

· 10 min read
Tian Pan
Software Engineer

Every production agent team eventually builds the same thing: a memory store that grows without bound, retrieval that degrades silently, and a frantic sprint to add forgetting after users report that the agent is referencing their old job, a deprecated API, or a project that was cancelled three months ago. The industry has poured enormous effort into giving agents memory. The harder engineering problem — garbage collecting that memory — is where the real production reliability lives.

The parallel to software garbage collection is more than metaphorical. Agent memory systems face the same fundamental tension: you need to reclaim resources (context budget, retrieval relevance) without destroying data that's still reachable (semantically relevant to future queries). The algorithms that solve this look surprisingly similar to the ones your runtime already uses.

Your Code Review Process Is Optimized for the Wrong Failure Mode

· 8 min read
Tian Pan
Software Engineer

Your code review checklist was designed for a world where the primary defect was a misplaced semicolon or a forgotten null check. That world is gone. AI-generated code rarely has typos. It almost always compiles. And it is quietly degrading your codebase in ways your review process was never built to catch.

Analysis of hundreds of thousands of GitHub pull requests reveals that AI-generated code creates 1.7x more issues than human-written code — roughly 10.8 issues per PR versus 6.5. But the defect distribution has shifted fundamentally. Logic errors are up 75%. Performance issues appear nearly 8x more often. Security vulnerabilities are 1.5–2x more frequent. The bugs that matter most are exactly the ones your traditional review gates miss.

Data Provenance for AI Systems: Why Tracking Answer Origins Is Now an Engineering Requirement

· 10 min read
Tian Pan
Software Engineer

A production LLM answers a user's question incorrectly. A support ticket arrives. You pull the logs. They show the prompt, the completion, and the latency — but nothing about which documents the retrieval system surfaced, which chunks landed in the context window, or which passage the model leaned on most heavily when it synthesized the answer. You're left doing archaeology: re-running the query against a corpus that has since been updated, hoping the same results come back, wondering if the bug is in retrieval, in chunking, in the document itself, or in the model's reasoning.

This is the data provenance gap, and most AI teams don't notice it until they're already in it.