Skip to main content

763 posts tagged with "ai-engineering"

View all tags

The Agent Feedback Loop You Never Built

· 9 min read
Tian Pan
Software Engineer

Every day your agent ships failures back to you, gift-wrapped. A user clicks thumbs-down. Another reads the answer, says nothing, and closes the tab. A third rephrases the same question three times until the agent finally gets it. Each of those is a labeled failure case — a real input, a real context, a real moment where the system fell short — handed to you for free by the people who care most about getting it right.

Most teams throw all of it away. Not deliberately. The thumbs-down increments a dashboard counter. The abandonment shows up as a dip in a retention chart. The rephrasing looks like ordinary usage. Nothing captures the signal together with the context that produced it, so nothing can be replayed, triaged, or turned into a test. The richest source of evaluation data you will ever have flows past untouched, and the team keeps writing synthetic eval cases by hand.

This is the agent feedback loop you never built. It is not a tool you forgot to buy. It is a pipeline — from user signal, to triaged failure, to new eval case — and the reason it stays unbuilt has very little to do with technology.

Why You Can't Budget an AI Feature With a Single Number

· 9 min read
Tian Pan
Software Engineer

Finance asks one question about every feature you ship: "What does it cost per user?" For a traditional feature, the answer is a number. A page render, a database query, a push notification — each has a marginal cost that barely moves from one request to the next. You measure it once, multiply by your user count, and the forecast holds.

An AI feature breaks that contract. Ask "what does this agent cost per request" and the honest answer is not a number, it's a histogram. The same agent that resolves one ticket for two cents will burn four dollars on the next one, because that user asked a vague question, the agent looped through eleven tool calls, and each call dragged the entire growing conversation back through the model. The mean of those two requests — two dollars — describes neither of them, and it definitely doesn't describe the bill.

That is the trap. When you hand finance a single average cost, you are not simplifying a messy reality. You are reporting a number that is wrong in a specific, expensive direction.

What a Coding Interview Measures When the Candidate Has an Agent

· 9 min read
Tian Pan
Software Engineer

The coding interview was built to isolate a single variable. Put a person in a room, give them a problem, take away their references, and watch whether they can turn the problem into working code by themselves. Everything about the format — the whiteboard, the blank editor, the prohibition on looking things up — exists to strip away collaborators and tools so you measure one isolated skill: can this person, alone, write correct code under pressure.

That skill is no longer the one the job exercises. Day-to-day engineering in 2026 is a collaboration between an engineer and an agent. The engineer decides what to build, the agent drafts most of the code, and the engineer's real work is reviewing, correcting, and deciding when the agent is confidently wrong. The interview measures solo code production. The job rewards directing a tireless, fast, occasionally hallucinating collaborator. The proxy and the target have come apart, and most hiring pipelines haven't noticed.

This is not a complaint about cheating, though cheating is the symptom everyone fixates on. It's a measurement problem. When you can no longer observe the variable your test was designed to isolate, the test stops producing signal — and a test that produces no signal while everyone still trusts it is worse than no test at all.

The Embedding Upgrade That Silently Re-Ranks Your Entire Corpus

· 9 min read
Tian Pan
Software Engineer

A new embedding model lands on the leaderboard. It scores higher than the one you shipped eighteen months ago, the API is a one-line change, and the dimensions even match. Someone files a ticket: "upgrade embedding model." It looks like swapping a logging library.

It is not. The embedding model is not a component of your retrieval system — it is the coordinate system your retrieval system lives in. Changing it does not improve your index. It invalidates it. And the cruelest part is that nothing crashes. No exception, no failed health check. Your search just starts returning subtly different results, and "subtly different" in a RAG pipeline means a different document feeds the model, which means a different answer reaches the user.

The Eval Set That Got Easier While You Weren't Looking

· 9 min read
Tian Pan
Software Engineer

You wrote the eval set eighteen months ago. Back then it was a useful instrument: the cheap model scored 71%, the better one scored 84%, and when a regression slipped in the number dropped and somebody noticed. The suite earned its place in CI. You stopped thinking about it.

Run it today and every candidate model scores 96, 97, 98. The new release scores the same as the old one. The model you suspect is worse scores the same as the model you suspect is better. The number still renders green in the dashboard, the check still passes, and it tells you exactly nothing. Your eval set didn't break. It got easier — because the models got better underneath it — and nobody was watching the moment it stopped discriminating.

This is eval saturation, and it is not a failure mode you might hit. It is the guaranteed end state of any static suite given a long enough timeline. A test that everything passes has stopped being a test.

The Eval Set Is a Lagging Indicator: Your Green Dashboard Only Knows Last Quarter's Failures

· 8 min read
Tian Pan
Software Engineer

Every mature AI team builds its eval suite the same way, and almost nobody says the quiet part out loud. A failure shows up in production. Someone writes a postmortem. An engineer distills the incident into a test case, adds it to the eval suite, and the dashboard goes green again. Repeat this loop for a year and you have a few hundred cases, a satisfying pass rate, and a deeply comforting number to put on a slide.

Here is the quiet part: that suite is a museum. Every exhibit is a failure class the team has already survived. A 98% pass rate certifies your system against the past — against the specific ways it has already broken — and says almost nothing about the novel failure mode that a model migration, a prompt edit, or a shift in user behavior is about to introduce. The eval set is a lagging indicator wearing the costume of a leading one.

Your Eval Set Is a Frozen Photograph of Traffic Your Users Already Left

· 10 min read
Tian Pan
Software Engineer

You shipped a model upgrade. The eval suite went from 87% to 91%. The release notes wrote themselves, leadership clapped, and then the dashboards that actually matter — user satisfaction, escalation rate, thumbs-down ratio — did nothing. Flat. Maybe slightly worse.

This is one of the most disorienting failure modes in AI engineering, because nothing is broken. The eval ran correctly. The numbers are real. The model genuinely improved on the 600 examples you tested it against. The problem is that those 600 examples are a photograph of traffic from the week you built the suite, and your users have spent the months since then walking out of frame.

The Eval Suite That Became the Spec Nobody Agreed To

· 8 min read
Tian Pan
Software Engineer

Open any mature agent codebase and ask a simple question: where is the requirements document? Not the pitch deck, not the launch doc, not the Notion page that was last touched in Q3. Where is the artifact that says, concretely and unambiguously, what this agent is supposed to do?

For most teams, the honest answer is the eval suite. There is a folder of test cases — inputs paired with expected outputs, rubrics, judge prompts — and a CI gate that says pass or fail. That folder is the only place where "correct" is defined precisely enough to be executed. Everything else is prose, and prose drifts.

This is not inherently bad. An executable spec is more honest than a PRD that nobody reads. The problem is that almost nobody treats the eval suite as a spec. It was assembled by one engineer, under deadline, to make a release gate go green. It encodes a hundred judgment calls that were never written down, never reviewed, and never agreed to. And the model is now optimized precisely to it.

The Fallback Model You Never Load-Tested

· 8 min read
Tian Pan
Software Engineer

Every resilient LLM design has a line in the config that names a secondary model. It is there because someone, during a design review, asked the right question — "what happens when the primary is down?" — and someone else answered it with a fallback: key. Everyone nodded. The architecture diagram got a second box with a dotted arrow. The compliance doc got a sentence about graceful degradation.

And then nobody touched it again.

The fallback model is the most confidently asserted, least exercised component in most production AI systems. It is named, documented, and diagrammed — and on the day it actually carries traffic, it is also the day it has its first encounter with a real request. You did not build a safety net. You built a second model with an unknown breaking strain, and you will discover that strain at the worst possible moment.

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

The LLM Judge Is a Versioned Dependency, Not Neutral Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most teams treat their LLM judge the way they treat a unit-test runner: neutral infrastructure that produces a number you can trust. You write a rubric, point a model at your outputs, and the judge returns scores. The scores go on a dashboard. The dashboard's trendline drives the roadmap. Nobody thinks of the judge as a thing that has behavior, because the whole point of automation was to take behavior out of the loop.

But the judge is a model. It has a version. It has biases. And the day it changes — because your eval-platform team swapped it for something cheaper, or because the provider silently rolled the weights behind a -latest alias — every historical score it produced becomes incomparable to every new one. Your quarter-over-quarter quality trend is now denominated in two different currencies, and no one printed an exchange rate.

This is not a hypothetical edge case. It is the default outcome of using an LLM as a measurement instrument without versioning it like one.

The Semantic Cache That Confidently Returns the Wrong Answer

· 9 min read
Tian Pan
Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.