Skip to main content

780 posts tagged with "ai-engineering"

View all tags

The Eval Suite That Became the Spec Nobody Agreed To

· 8 min read
Tian Pan
Software Engineer

Open any mature agent codebase and ask a simple question: where is the requirements document? Not the pitch deck, not the launch doc, not the Notion page that was last touched in Q3. Where is the artifact that says, concretely and unambiguously, what this agent is supposed to do?

For most teams, the honest answer is the eval suite. There is a folder of test cases — inputs paired with expected outputs, rubrics, judge prompts — and a CI gate that says pass or fail. That folder is the only place where "correct" is defined precisely enough to be executed. Everything else is prose, and prose drifts.

This is not inherently bad. An executable spec is more honest than a PRD that nobody reads. The problem is that almost nobody treats the eval suite as a spec. It was assembled by one engineer, under deadline, to make a release gate go green. It encodes a hundred judgment calls that were never written down, never reviewed, and never agreed to. And the model is now optimized precisely to it.

The Fallback Model You Never Load-Tested

· 8 min read
Tian Pan
Software Engineer

Every resilient LLM design has a line in the config that names a secondary model. It is there because someone, during a design review, asked the right question — "what happens when the primary is down?" — and someone else answered it with a fallback: key. Everyone nodded. The architecture diagram got a second box with a dotted arrow. The compliance doc got a sentence about graceful degradation.

And then nobody touched it again.

The fallback model is the most confidently asserted, least exercised component in most production AI systems. It is named, documented, and diagrammed — and on the day it actually carries traffic, it is also the day it has its first encounter with a real request. You did not build a safety net. You built a second model with an unknown breaking strain, and you will discover that strain at the worst possible moment.

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

The LLM Judge Is a Versioned Dependency, Not Neutral Infrastructure

· 9 min read
Tian Pan
Software Engineer

Most teams treat their LLM judge the way they treat a unit-test runner: neutral infrastructure that produces a number you can trust. You write a rubric, point a model at your outputs, and the judge returns scores. The scores go on a dashboard. The dashboard's trendline drives the roadmap. Nobody thinks of the judge as a thing that has behavior, because the whole point of automation was to take behavior out of the loop.

But the judge is a model. It has a version. It has biases. And the day it changes — because your eval-platform team swapped it for something cheaper, or because the provider silently rolled the weights behind a -latest alias — every historical score it produced becomes incomparable to every new one. Your quarter-over-quarter quality trend is now denominated in two different currencies, and no one printed an exchange rate.

This is not a hypothetical edge case. It is the default outcome of using an LLM as a measurement instrument without versioning it like one.

The Semantic Cache That Confidently Returns the Wrong Answer

· 9 min read
Tian Pan
Software Engineer

Two support users ask your agent almost the same question within a minute of each other. The first asks, "What's our refund window for EU orders?" The second asks, "What's our refund window for US orders?" The embeddings of those two sentences sit a hair's breadth apart — same length, same structure, one two-letter token of difference. Your semantic cache, tuned to a similarity threshold that looked perfectly reasonable in the demo, scores them as a match. The second user gets the first user's answer. The EU's 14-day cooling-off period is presented to a US customer as fact, in fluent prose, with no asterisk.

Nobody gets paged for this. The cache returned a 200. Latency was great. The cost dashboard shows a hit, which is the outcome everyone wanted. The only signal that anything went wrong is a customer acting on policy that does not apply to them — and that signal arrives days later, through a refund dispute, not through your monitoring.

This is the failure mode that makes semantic caching different from every cache you have built before. An exact-match cache can be stale, but it is never wrong — the key either matches or it doesn't. A semantic cache trades that guarantee away on purpose. It is designed to return answers for keys it has never seen, and the price of that latency win is a correctness risk that most teams never put a number on.

The Streaming Token the User Acted On Too Soon

· 9 min read
Tian Pan
Software Engineer

A user asked your assistant whether a config change was safe to ship. The model streamed back: "Yes, you can deploy this safely." Three hundred milliseconds later it continued: "— except in the us-east region, where the old connection pool is still draining." But the user had already read the first half, felt the relief of a green light, and clicked deploy. The qualification arrived to an empty room.

Nobody made a mistake here. The model was correct. The user read what was on screen. The renderer faithfully displayed every token the moment it arrived. And yet the outcome was a bad deploy, because streaming turned the model's intermediate state into something the user treated as final.

Structured Output Is Not Validated Output

· 9 min read
Tian Pan
Software Engineer

The day your team turns on schema-constrained decoding feels like a milestone. The parsing errors stop. The JSONDecodeError alerts go quiet. The flaky regex that scraped fields out of prose gets deleted. Someone says "the model returns valid JSON now" in standup, and the structured-output ticket gets closed.

That sentence is where the trouble starts. "The model returns valid JSON now" is the beginning of correctness work, not the end of it. JSON mode and constrained decoding guarantee the shape of a response — that quantity is an integer, that status is one of three enum values, that the object has the keys you asked for. They guarantee nothing about whether quantity is the right number, whether status reflects what actually happened, or whether the sku field points at a product that exists in your catalog.

The Test the Agent Wrote That Tests Nothing

· 10 min read
Tian Pan
Software Engineer

Ask a coding agent to "add tests for this module" and you will get tests. They will be neatly formatted, they will follow your project's conventions, and they will pass. Coverage will tick up. The PR will look like diligence. And a meaningful fraction of those tests will not be able to catch a single bug you might plausibly introduce.

This is not a story about a model being dumb. The agent did exactly what it was asked. The problem is that "add tests" and "add tests that constrain the behavior" are different requests, and only one of them is easy to verify at a glance. A green checkmark looks identical whether it sits on top of a real assertion or a tautology.

The result is a suite that grows in line count and shrinks in power. You end up with more files, more CI minutes, more things to maintain — and roughly the same probability of shipping a regression as before you started.

The Eval That Quietly Went Stale: When Your Test Suite Measures a World That No Longer Exists

· 9 min read
Tian Pan
Software Engineer

Your eval suite passed. All 240 cases green, same as last week. You ship. Two days later support tickets spike, and when you read the transcripts you find a failure mode your suite has no opinion about at all — not a case that flipped from pass to fail, but a question your users started asking that your suite never thought to ask.

This is the quiet failure of evals. We treat a green run as a statement about the present: "the system works." It is actually a statement about a past — the moment the eval cases were written. An eval authored six months ago encodes three things as they were that day: the product's scope, the model's failure modes, and the way real users phrase their requests. All three move. The feature grew a new surface. The model got upgraded twice. The input distribution drifted as users learned what the product could do. The suite did not move with them, so a green run increasingly certifies a world that no longer exists.

Nobody notices, because nothing breaks. A stale eval does not throw an error. It keeps passing, confidently, while measuring less and less of what matters.

Your Tool Descriptions Are an Instruction Channel the Model Obeys

· 8 min read
Tian Pan
Software Engineer

When a security team reviews a new tool integration, they read the code. They check what the function does, what it touches, what scopes it needs, whether it logs secrets. They almost never read the one sentence that decides whether the model calls it at all — the tool's description. That sentence is not documentation. It is an instruction the model treats as authoritative, and in most agent stacks nobody reviews it.

A tool description is written for the model to read. The model uses it to decide when the tool is relevant, what arguments to pass, and how to interpret what comes back. That makes the description a control channel into the model's behavior. And the moment a tool arrives from a third-party registry, a Model Context Protocol (MCP) server you don't operate, or a plugin a teammate installed last week, that control channel is authored by someone you never agreed to trust.

This is the gap. Input sanitization inspects what users type. Code review inspects what functions execute. The tool description sits between them — it is configuration that behaves like input — and it falls through both nets.

When Two Agents Share a Tool: Concurrency Bugs in Multi-Agent Systems

· 9 min read
Tian Pan
Software Engineer

The moment you typed "spin up another agent to handle that in parallel," you became a distributed systems engineer. You probably didn't notice. The framework made it a one-line change, the demo worked, and the latency dropped. But under the hood you just introduced two processes that read and write shared state with no coordination — and every race condition, lost update, and dirty read that has haunted databases for fifty years is now sitting in your agent stack, waiting.

The reason this bites so hard is that the failure doesn't look like a concurrency bug. It looks like one agent being wrong. The output is syntactically valid, the pipeline is green, no exception is thrown — and yet a customer got charged twice, or a file is missing half its expected content, or an agent confidently acted on a number that another agent had already overwritten. You go debug "the dumb agent" and find nothing wrong with its prompt, because the prompt was never the problem.

Your Vector Index Is a Cache With No Invalidation Strategy

· 9 min read
Tian Pan
Software Engineer

A vector index feels like a database. You write documents into it, you query it, it returns results. But it is not a database — it is a derived, denormalized copy of data that lives somewhere else. Your source of truth is a wiki, a ticket system, a CRM, a folder of PDFs. The embeddings are a projection of that truth, frozen at the moment you ran the ingestion job.

That makes your vector index a cache. And like every cache, it goes stale. The difference is that most teams build a caching layer on purpose, with a TTL and an invalidation hook, while almost nobody builds a vector index on purpose as a cache. They build it as a "knowledge base" and then act surprised when it serves knowledge that stopped being true three weeks ago.