Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The Eval Budget Your CFO Cannot See on a Spreadsheet

· 8 min read
Tian Pan
Software Engineer

Open any quarterly planning spreadsheet and you can find every feature your team shipped, every contractor invoice, every cloud line item. What you will not find is a row for the outage that never happened, the hallucinated refund that was caught before it reached a customer, or the prompt regression that an eval blocked at 2 a.m. Those non-events have no SKU. They generate no ticket, no postmortem, no Slack thread. And so, when the eval budget comes up for renewal, it is competing for headcount against a feature that has a demo — and it loses, almost every time.

This is not a failure of nerve. It is a measurement problem. Eval investment behaves like a safety net and a test suite at the same time: it compounds quietly, it pays out in disasters avoided, and its entire value is counterfactual. Finance is structurally blind to counterfactuals. If you lead an AI team, your job is not to argue that evals are important — everyone already nods at that. Your job is to make a compounding, invisible return legible to people who only trust spreadsheets.

Hiring for AI Roles That Have No Career Ladder Yet

· 9 min read
Tian Pan
Software Engineer

You open a requisition for an "eval engineer." A week later your recruiter asks the obvious question: what level is this, and what does a good resume look like? You don't have an answer. The title didn't exist two years ago. There is no leveling rubric, no canonical interview loop, no pool of people with the words "eval engineer" already on their LinkedIn. You are hiring for a job the industry has not agreed exists.

This is the quiet bottleneck in shipping AI systems. The model is available. The infrastructure is rentable. What you cannot buy off the shelf is the person whose actual job is keeping a prompt-driven system honest — and your hiring machinery, built for roles with decades of precedent, has no slot for them.

The instinct is to wait. Wait for the title to standardize, for the bootcamps to mint candidates, for someone else to write the leveling guide you can copy. That instinct is wrong. The work exists now whether or not the title does, and the teams staffing it now are the ones learning what "good" looks like before their competitors even open the req.

The Postmortem Where the Root Cause Was a Prompt Nobody Owned

· 9 min read
Tian Pan
Software Engineer

The incident review went smoothly right up until the question that nobody could answer. Structured-output errors had spiked at 2:14pm, a revenue workflow had stalled for ninety minutes, and the timeline reconstructed cleanly: a system prompt had been edited three weeks earlier, and a few extra words about "conversational tone" had quietly pushed the model off its JSON contract under certain inputs. The fix was a one-line revert. The hard part came next. Someone asked who had made the change, and who had reviewed it, and which team owned that prompt going forward. The room went quiet. There was no pull request. There was no reviewer. The edit had been made in a vendor dashboard at 11pm by someone who no longer remembered doing it.

That silence is the actual incident. The JSON contract breaking was a symptom. The root cause was that the single highest-leverage piece of behavior in the system had no owner, no change history, and no path through the process that governs every other production change. The model didn't fail. The model did exactly what it was told. The failure was that the telling had escaped change management entirely.

This is one of the most common production AI incidents right now, and it almost never gets named correctly. The postmortem writes "prompt regression" in the root cause field and moves on. But "prompt regression" describes the code. The real root cause is an org chart with a hole in it.

The Token Budget Is a Product Decision, Not a Config Value

· 10 min read
Tian Pan
Software Engineer

Somewhere in your codebase there is a line that looks like retriever.search(query, top_k=8). An engineer wrote that 8 in an afternoon. It was never reviewed by anyone outside the team, never appeared in a spec, and has never been revisited. That single integer decides how much of your context window goes to retrieved documents instead of conversation history, how much each request costs, how slow the response feels, and — because of how language models actually behave at length — how accurate the answer is.

That is a product decision. It is sitting in an f-string.

Your Voice Agent Trusts Every Transcription Error as Fact

· 10 min read
Tian Pan
Software Engineer

A user calls your insurance voice agent and asks about their deductible. The speech recognizer hears "the duck tibble." Your language model receives the string "the duck tibble," finds nothing coherent to do with it, and either asks a confused follow-up question or — worse — confabulates an answer about a product that does not exist. The user hangs up. Your logs show a successful turn: audio in, transcript produced, response generated, no error thrown.

That is the quiet failure at the heart of nearly every voice agent in production. The speech-to-text system did its job — it produced its single best guess. The language model did its job — it reasoned over the text it was handed. The bug lives in the gap between them, in a handoff that takes a probabilistic guess and relabels it as a fact.

The Agent Feedback Loop You Never Built

· 9 min read
Tian Pan
Software Engineer

Every day your agent ships failures back to you, gift-wrapped. A user clicks thumbs-down. Another reads the answer, says nothing, and closes the tab. A third rephrases the same question three times until the agent finally gets it. Each of those is a labeled failure case — a real input, a real context, a real moment where the system fell short — handed to you for free by the people who care most about getting it right.

Most teams throw all of it away. Not deliberately. The thumbs-down increments a dashboard counter. The abandonment shows up as a dip in a retention chart. The rephrasing looks like ordinary usage. Nothing captures the signal together with the context that produced it, so nothing can be replayed, triaged, or turned into a test. The richest source of evaluation data you will ever have flows past untouched, and the team keeps writing synthetic eval cases by hand.

This is the agent feedback loop you never built. It is not a tool you forgot to buy. It is a pipeline — from user signal, to triaged failure, to new eval case — and the reason it stays unbuilt has very little to do with technology.

Why You Can't Budget an AI Feature With a Single Number

· 9 min read
Tian Pan
Software Engineer

Finance asks one question about every feature you ship: "What does it cost per user?" For a traditional feature, the answer is a number. A page render, a database query, a push notification — each has a marginal cost that barely moves from one request to the next. You measure it once, multiply by your user count, and the forecast holds.

An AI feature breaks that contract. Ask "what does this agent cost per request" and the honest answer is not a number, it's a histogram. The same agent that resolves one ticket for two cents will burn four dollars on the next one, because that user asked a vague question, the agent looped through eleven tool calls, and each call dragged the entire growing conversation back through the model. The mean of those two requests — two dollars — describes neither of them, and it definitely doesn't describe the bill.

That is the trap. When you hand finance a single average cost, you are not simplifying a messy reality. You are reporting a number that is wrong in a specific, expensive direction.

What a Coding Interview Measures When the Candidate Has an Agent

· 9 min read
Tian Pan
Software Engineer

The coding interview was built to isolate a single variable. Put a person in a room, give them a problem, take away their references, and watch whether they can turn the problem into working code by themselves. Everything about the format — the whiteboard, the blank editor, the prohibition on looking things up — exists to strip away collaborators and tools so you measure one isolated skill: can this person, alone, write correct code under pressure.

That skill is no longer the one the job exercises. Day-to-day engineering in 2026 is a collaboration between an engineer and an agent. The engineer decides what to build, the agent drafts most of the code, and the engineer's real work is reviewing, correcting, and deciding when the agent is confidently wrong. The interview measures solo code production. The job rewards directing a tireless, fast, occasionally hallucinating collaborator. The proxy and the target have come apart, and most hiring pipelines haven't noticed.

This is not a complaint about cheating, though cheating is the symptom everyone fixates on. It's a measurement problem. When you can no longer observe the variable your test was designed to isolate, the test stops producing signal — and a test that produces no signal while everyone still trusts it is worse than no test at all.

The Embedding Upgrade That Silently Re-Ranks Your Entire Corpus

· 9 min read
Tian Pan
Software Engineer

A new embedding model lands on the leaderboard. It scores higher than the one you shipped eighteen months ago, the API is a one-line change, and the dimensions even match. Someone files a ticket: "upgrade embedding model." It looks like swapping a logging library.

It is not. The embedding model is not a component of your retrieval system — it is the coordinate system your retrieval system lives in. Changing it does not improve your index. It invalidates it. And the cruelest part is that nothing crashes. No exception, no failed health check. Your search just starts returning subtly different results, and "subtly different" in a RAG pipeline means a different document feeds the model, which means a different answer reaches the user.

The Eval Set That Got Easier While You Weren't Looking

· 9 min read
Tian Pan
Software Engineer

You wrote the eval set eighteen months ago. Back then it was a useful instrument: the cheap model scored 71%, the better one scored 84%, and when a regression slipped in the number dropped and somebody noticed. The suite earned its place in CI. You stopped thinking about it.

Run it today and every candidate model scores 96, 97, 98. The new release scores the same as the old one. The model you suspect is worse scores the same as the model you suspect is better. The number still renders green in the dashboard, the check still passes, and it tells you exactly nothing. Your eval set didn't break. It got easier — because the models got better underneath it — and nobody was watching the moment it stopped discriminating.

This is eval saturation, and it is not a failure mode you might hit. It is the guaranteed end state of any static suite given a long enough timeline. A test that everything passes has stopped being a test.

The Eval Set Is a Lagging Indicator: Your Green Dashboard Only Knows Last Quarter's Failures

· 8 min read
Tian Pan
Software Engineer

Every mature AI team builds its eval suite the same way, and almost nobody says the quiet part out loud. A failure shows up in production. Someone writes a postmortem. An engineer distills the incident into a test case, adds it to the eval suite, and the dashboard goes green again. Repeat this loop for a year and you have a few hundred cases, a satisfying pass rate, and a deeply comforting number to put on a slide.

Here is the quiet part: that suite is a museum. Every exhibit is a failure class the team has already survived. A 98% pass rate certifies your system against the past — against the specific ways it has already broken — and says almost nothing about the novel failure mode that a model migration, a prompt edit, or a shift in user behavior is about to introduce. The eval set is a lagging indicator wearing the costume of a leading one.

Your Eval Set Is a Frozen Photograph of Traffic Your Users Already Left

· 10 min read
Tian Pan
Software Engineer

You shipped a model upgrade. The eval suite went from 87% to 91%. The release notes wrote themselves, leadership clapped, and then the dashboards that actually matter — user satisfaction, escalation rate, thumbs-down ratio — did nothing. Flat. Maybe slightly worse.

This is one of the most disorienting failure modes in AI engineering, because nothing is broken. The eval ran correctly. The numbers are real. The model genuinely improved on the 600 examples you tested it against. The problem is that those 600 examples are a photograph of traffic from the week you built the suite, and your users have spent the months since then walking out of frame.