Skip to main content

780 posts tagged with "ai-engineering"

View all tags

Tokenizer Arithmetic: The Hidden Layer That Bites You in Production

· 10 min read
Tian Pan
Software Engineer

A team ships a JSON extraction pipeline. It works perfectly in development: 98% accuracy, clean structured output, predictable token counts. They push to production. The model starts hallucinating extra whitespace, the JSON parser chokes on malformed keys, and the API bill is 2.3x what the prototype suggested. The model hasn't changed. The prompts haven't changed.

The tokenizer changed — or more precisely, their assumptions about it were wrong from the start.

Tokenization is the first transformation your input undergoes and the last one engineers think about when debugging. Most teams treat it as a solved problem: text goes in, tokens come out, the model does its thing. But Byte Pair Encoding (BPE), the tokenization algorithm behind most production LLMs, makes decisions that cascade through structured output generation, prefix caching, cost estimation, and multilingual deployment in ways that are entirely predictable once you know to look.

The Trust Calibration Gap: Why AI Features Get Ignored or Blindly Followed

· 9 min read
Tian Pan
Software Engineer

You shipped an AI feature. The model is good — you measured it. Precision is 91%, recall is solid, the P99 latency is under 400ms. Three months later, product analytics tell a grim story: power users have turned it off entirely, while a different cohort is accepting every suggestion without changing a word, including the ones that are clearly wrong.

This is the trust calibration gap. It's not a model problem. It's a design problem — and it's more common than most AI product teams admit.

Zero-Downtime AI Deployments: It's a Distributed Systems Problem

· 10 min read
Tian Pan
Software Engineer

In April 2025, OpenAI shipped a system prompt update to GPT-4o. Within hours, 180 million users noticed ChatGPT had become obsequiously flattering. The failure wasn't caught by monitoring. It was caught by Twitter. Rollback took three days.

That incident revealed something the AI industry had been quietly avoiding: prompt changes are production deployments. And most teams treat them like config file edits.

The core problem with AI deployments is that you're not deploying one thing — you're deploying four: model weights, prompt text, tool schemas, and the context structure they all assume. Each can drift independently. Each can be partially rolled out. And unlike a broken API endpoint, AI failures are often probabilistic, gradual, and invisible until they've already affected a large fraction of your traffic.

This is the distributed systems consistency problem, wearing an AI hat.

Agent Memory Garbage Collection: Engineering Strategic Forgetting at Scale

· 10 min read
Tian Pan
Software Engineer

Every production agent team eventually builds the same thing: a memory store that grows without bound, retrieval that degrades silently, and a frantic sprint to add forgetting after users report that the agent is referencing their old job, a deprecated API, or a project that was cancelled three months ago. The industry has poured enormous effort into giving agents memory. The harder engineering problem — garbage collecting that memory — is where the real production reliability lives.

The parallel to software garbage collection is more than metaphorical. Agent memory systems face the same fundamental tension: you need to reclaim resources (context budget, retrieval relevance) without destroying data that's still reachable (semantically relevant to future queries). The algorithms that solve this look surprisingly similar to the ones your runtime already uses.

Your Code Review Process Is Optimized for the Wrong Failure Mode

· 8 min read
Tian Pan
Software Engineer

Your code review checklist was designed for a world where the primary defect was a misplaced semicolon or a forgotten null check. That world is gone. AI-generated code rarely has typos. It almost always compiles. And it is quietly degrading your codebase in ways your review process was never built to catch.

Analysis of hundreds of thousands of GitHub pull requests reveals that AI-generated code creates 1.7x more issues than human-written code — roughly 10.8 issues per PR versus 6.5. But the defect distribution has shifted fundamentally. Logic errors are up 75%. Performance issues appear nearly 8x more often. Security vulnerabilities are 1.5–2x more frequent. The bugs that matter most are exactly the ones your traditional review gates miss.

Data Provenance for AI Systems: Why Tracking Answer Origins Is Now an Engineering Requirement

· 10 min read
Tian Pan
Software Engineer

A production LLM answers a user's question incorrectly. A support ticket arrives. You pull the logs. They show the prompt, the completion, and the latency — but nothing about which documents the retrieval system surfaced, which chunks landed in the context window, or which passage the model leaned on most heavily when it synthesized the answer. You're left doing archaeology: re-running the query against a corpus that has since been updated, hoping the same results come back, wondering if the bug is in retrieval, in chunking, in the document itself, or in the model's reasoning.

This is the data provenance gap, and most AI teams don't notice it until they're already in it.

Goodhart's Law in Your LLM Eval Suite: When Optimizing the Score Breaks the System

· 9 min read
Tian Pan
Software Engineer

Andrej Karpathy put it bluntly: AI labs were "overfitting" to Arena rankings. One major lab privately evaluated 27 model variants before their public release, publishing only the top performer. Researchers estimated that selective submission alone could artificially inflate leaderboard scores by up to 112%. The crowdsourced evaluation system that everyone pointed to as ground truth had become a target — and once it became a target, it stopped being a useful measure.

This is Goodhart's Law in action: when a measure becomes a target, it ceases to be a good measure. It's been well-understood in economics and policy for decades. In LLM engineering, it's actively destroying eval suites right now, often without the teams building them realizing it.

Machine-Readable Project Context: Why Your CLAUDE.md Matters More Than Your Model

· 8 min read
Tian Pan
Software Engineer

Most teams that adopt AI coding agents spend the first week arguing about which model to use. They benchmark Opus vs. Sonnet vs. GPT-4o on contrived examples, obsess over the leaderboard, and eventually pick something. Then they spend the next three months wondering why the agent keeps rebuilding the wrong abstractions, ignoring their test strategy, and repeatedly asking which package manager to use.

The model wasn't the problem. The context file was.

Every AI coding tool — Claude Code, Cursor, GitHub Copilot, Windsurf — reads a project-specific markdown file at the start of each session. These files go by different names: CLAUDE.md, .cursor/rules/, .github/copilot-instructions.md, AGENTS.md. But they share the same purpose: teaching the agent what it cannot infer from reading the code alone. The quality of this file now predicts output quality more reliably than the model behind it. Yet most teams write them once, badly, and never touch them again.

Measuring Real AI Coding Productivity: The Metrics That Survive the 90-Day Lag

· 9 min read
Tian Pan
Software Engineer

Most teams adopting AI coding tools hit the same wall. Month one looks like a success story: PR throughput is up, sprint velocity is climbing, and the engineering manager is putting together a slide deck to share with leadership. By month three, something has quietly gone wrong. Incidents creep up. Senior engineers are spending more time in review. A simple bug fix now requires understanding code nobody on the team actually wrote. The productivity gains have evaporated — but the measurement system never caught it.

The problem is that the metrics most teams reach for first — lines generated, PRs merged, story points burned — are the wrong unit of measurement for AI-assisted development. They measure the cost of producing code, not the cost of owning it. And AI has made production nearly free while leaving ownership costs untouched.

Quality-Aware Model Routing: Why Optimizing for Cost Alone Wrecks Your AI Product

· 9 min read
Tian Pan
Software Engineer

Every team that ships LLM routing starts the same way: sort models by price, send easy queries to the cheap one, hard queries to the expensive one, celebrate the 60% cost reduction. Six weeks later, someone notices that contract analysis accuracy dropped from 94% to 79%, the coding assistant started hallucinating API endpoints that don't exist, and customer satisfaction on complex support tickets fell off a cliff — all while the routing dashboard showed "95% quality maintained."

The problem isn't routing itself. Cost-optimized routing treats all quality degradation as equal, when in practice the queries you're downgrading are disproportionately the ones where quality matters most.

Spec-to-Eval: Translating Product Requirements into Falsifiable LLM Criteria

· 9 min read
Tian Pan
Software Engineer

Most AI features are specified in prose and evaluated in prose. The PM writes "the assistant should respond helpfully and avoid harmful content." The engineer ships a prompt that, at demo time, produces output that seems to match. The team agrees at standup. They disagree at launch — when edge cases surface, when different engineers assess the same output differently, and when "helpful" turns out to mean seven different things depending on who's reviewing.

This isn't a tooling problem. It's a translation problem. The spec stayed abstract; the evaluation criteria were never made concrete. Spec-to-eval is the discipline of converting English requirements into falsifiable criteria before you write a single prompt — and doing it upfront changes everything about how fast you iterate.

The Ambient AI Coherence Problem: When Every Feature Is AI-Powered, Nothing Feels Like One Product

· 9 min read
Tian Pan
Software Engineer

Most AI products get the individual features right and the product wrong. Search returns plausible results. The summary is coherent. The chat assistant gives reasonable advice. But when a user searches for "best plan for small teams," gets a recommendation in the sidebar, asks the assistant a follow-up question, and then reads an auto-generated summary of their options — and all four contradict each other — none of the features feel trustworthy anymore. This is the ambient AI coherence problem: not hallucination in isolation, but contradiction at the product level.

The failure mode is subtle enough that teams often miss it entirely. Individual feature evals look fine. The search team measures recall and precision. The summarization team measures faithfulness. The chat team measures task completion. Nobody measures whether the AI-powered features of the product tell the same story about the same facts.