Skip to main content

763 posts tagged with "ai-engineering"

View all tags

The Streaming Token the User Acted On Too Soon

· 9 min read
Tian Pan
Software Engineer

A user asked your assistant whether a config change was safe to ship. The model streamed back: "Yes, you can deploy this safely." Three hundred milliseconds later it continued: "— except in the us-east region, where the old connection pool is still draining." But the user had already read the first half, felt the relief of a green light, and clicked deploy. The qualification arrived to an empty room.

Nobody made a mistake here. The model was correct. The user read what was on screen. The renderer faithfully displayed every token the moment it arrived. And yet the outcome was a bad deploy, because streaming turned the model's intermediate state into something the user treated as final.

Structured Output Is Not Validated Output

· 9 min read
Tian Pan
Software Engineer

The day your team turns on schema-constrained decoding feels like a milestone. The parsing errors stop. The JSONDecodeError alerts go quiet. The flaky regex that scraped fields out of prose gets deleted. Someone says "the model returns valid JSON now" in standup, and the structured-output ticket gets closed.

That sentence is where the trouble starts. "The model returns valid JSON now" is the beginning of correctness work, not the end of it. JSON mode and constrained decoding guarantee the shape of a response — that quantity is an integer, that status is one of three enum values, that the object has the keys you asked for. They guarantee nothing about whether quantity is the right number, whether status reflects what actually happened, or whether the sku field points at a product that exists in your catalog.

The Test the Agent Wrote That Tests Nothing

· 10 min read
Tian Pan
Software Engineer

Ask a coding agent to "add tests for this module" and you will get tests. They will be neatly formatted, they will follow your project's conventions, and they will pass. Coverage will tick up. The PR will look like diligence. And a meaningful fraction of those tests will not be able to catch a single bug you might plausibly introduce.

This is not a story about a model being dumb. The agent did exactly what it was asked. The problem is that "add tests" and "add tests that constrain the behavior" are different requests, and only one of them is easy to verify at a glance. A green checkmark looks identical whether it sits on top of a real assertion or a tautology.

The result is a suite that grows in line count and shrinks in power. You end up with more files, more CI minutes, more things to maintain — and roughly the same probability of shipping a regression as before you started.

The Eval That Quietly Went Stale: When Your Test Suite Measures a World That No Longer Exists

· 9 min read
Tian Pan
Software Engineer

Your eval suite passed. All 240 cases green, same as last week. You ship. Two days later support tickets spike, and when you read the transcripts you find a failure mode your suite has no opinion about at all — not a case that flipped from pass to fail, but a question your users started asking that your suite never thought to ask.

This is the quiet failure of evals. We treat a green run as a statement about the present: "the system works." It is actually a statement about a past — the moment the eval cases were written. An eval authored six months ago encodes three things as they were that day: the product's scope, the model's failure modes, and the way real users phrase their requests. All three move. The feature grew a new surface. The model got upgraded twice. The input distribution drifted as users learned what the product could do. The suite did not move with them, so a green run increasingly certifies a world that no longer exists.

Nobody notices, because nothing breaks. A stale eval does not throw an error. It keeps passing, confidently, while measuring less and less of what matters.

Your Tool Descriptions Are an Instruction Channel the Model Obeys

· 8 min read
Tian Pan
Software Engineer

When a security team reviews a new tool integration, they read the code. They check what the function does, what it touches, what scopes it needs, whether it logs secrets. They almost never read the one sentence that decides whether the model calls it at all — the tool's description. That sentence is not documentation. It is an instruction the model treats as authoritative, and in most agent stacks nobody reviews it.

A tool description is written for the model to read. The model uses it to decide when the tool is relevant, what arguments to pass, and how to interpret what comes back. That makes the description a control channel into the model's behavior. And the moment a tool arrives from a third-party registry, a Model Context Protocol (MCP) server you don't operate, or a plugin a teammate installed last week, that control channel is authored by someone you never agreed to trust.

This is the gap. Input sanitization inspects what users type. Code review inspects what functions execute. The tool description sits between them — it is configuration that behaves like input — and it falls through both nets.

When Two Agents Share a Tool: Concurrency Bugs in Multi-Agent Systems

· 9 min read
Tian Pan
Software Engineer

The moment you typed "spin up another agent to handle that in parallel," you became a distributed systems engineer. You probably didn't notice. The framework made it a one-line change, the demo worked, and the latency dropped. But under the hood you just introduced two processes that read and write shared state with no coordination — and every race condition, lost update, and dirty read that has haunted databases for fifty years is now sitting in your agent stack, waiting.

The reason this bites so hard is that the failure doesn't look like a concurrency bug. It looks like one agent being wrong. The output is syntactically valid, the pipeline is green, no exception is thrown — and yet a customer got charged twice, or a file is missing half its expected content, or an agent confidently acted on a number that another agent had already overwritten. You go debug "the dumb agent" and find nothing wrong with its prompt, because the prompt was never the problem.

Your Vector Index Is a Cache With No Invalidation Strategy

· 9 min read
Tian Pan
Software Engineer

A vector index feels like a database. You write documents into it, you query it, it returns results. But it is not a database — it is a derived, denormalized copy of data that lives somewhere else. Your source of truth is a wiki, a ticket system, a CRM, a folder of PDFs. The embeddings are a projection of that truth, frozen at the moment you ran the ingestion job.

That makes your vector index a cache. And like every cache, it goes stale. The difference is that most teams build a caching layer on purpose, with a TTL and an invalidation hook, while almost nobody builds a vector index on purpose as a cache. They build it as a "knowledge base" and then act surprised when it serves knowledge that stopped being true three weeks ago.

When the Cheap Model Is the Expensive One

· 9 min read
Tian Pan
Software Engineer

A finance team flags that the LLM bill is up 18% this quarter. An engineer pulls the usage dashboard, sees that 70% of traffic now hits the budget model instead of the frontier one, and is briefly confused: the routing change was supposed to cut spend. The per-token price went down exactly as the spreadsheet promised. The bill went up anyway.

This is not a billing error. It is the most common way a cost optimization quietly inverts itself. The spreadsheet that justified the downgrade priced one thing — tokens — and the production system pays for something else entirely: finished tasks. A weaker model does not just produce cheaper tokens. It changes the behavior of every component around it, and those second-order effects land on the same invoice.

The trap is seductive because the first-order math is genuinely correct. A budget model can be 10x to 30x cheaper per token than a frontier model, and for a large fraction of traffic it returns an answer that is indistinguishable in quality. The mistake is not the routing decision. The mistake is measuring the routing decision at the wrong boundary.

The 14-Month Half-Life of Your Prompt Expert

· 9 min read
Tian Pan
Software Engineer

Every company shipping AI features in production has one or two engineers it cannot afford to lose, and most of them do not know who those engineers are until the resignation email arrives.

The person in question is rarely the loudest in the room. They are the one who remembers that the customer-support summarizer's tone got fixed by a three-line system-prompt edit after the Q2 escalation, that the eval suite added six cases the week the model provider quietly changed its default sampling, and that the judge calibration drifted the last time someone "cleaned up" the rubric. None of this is written down in a place a successor would find. It lives in one head, and that head is being messaged by a recruiter with a 25% raise attached roughly every two weeks.

The Confidence-Score Tax: Why Asking the Model How Sure It Is Costs More Than Being Wrong

· 10 min read
Tian Pan
Software Engineer

Somewhere in the evolution of every AI feature, a reviewer asks a reasonable-sounding question: "Can we have the model tell us how confident it is, so we can route the low-confidence answers to a human or a fallback?" It sounds like free insurance. You add a confidence field to the output schema, the model dutifully fills it in, and now you have a dial to turn. Ship it.

That dial is not free, and worse, it is usually not wired to anything. The confidence number is a token sequence the model is happy to produce and under no obligation to mean. Teams pay real tokens and real latency to acquire it, never check whether it correlates with correctness, and then route production traffic on it as if "0.9" were a 90% reliability estimate. It is a gauge bolted to the dashboard with nothing behind the glass.

This post is about the two costs nobody priced: the per-request tax of generating the confidence field at all, and the much larger cost of trusting an uncalibrated number to make routing decisions.

The PM-Eval Translation Gap: When Ship Decisions Outrun the Vocabulary

· 8 min read
Tian Pan
Software Engineer

The go/no-go meeting for an AI feature is, on the surface, a data-driven ritual. Engineering brings a slate of eval numbers — judge score deltas, slice accuracies, regression-against-baseline percentages — and the room decides. It looks rigorous. It usually isn't.

Here is the failure mode in one sentence: the person with the literacy to weight the eval slices does not have the authority to make the call, and the person with the authority cannot read the slices. The product manager owns the launch. The engineer owns the meaning of the numbers. Between them sits a translation gap, and into that gap rushes whoever speaks most confidently in the meeting.

The tell is that "ship at 87%" and "hold at 87%" are both defensible from the same scorecard, depending on which slice you weight. When a single dataset supports opposite conclusions and the deciding factor is rhetorical confidence rather than evidence, you do not have a data-driven process. You have a debate with a spreadsheet in the background.

The Retry That Changed the Answer: Idempotency Keys for Nondeterministic LLM Calls

· 9 min read
Tian Pan
Software Engineer

Every distributed system you have ever built leans on one quiet assumption: a retry after a timeout is safe. The operation is idempotent, so if the client gives up waiting and re-sends, the worst case is duplicate work that converges to the same state. Two PUTs land the same row. Two DELETEs leave the same absence. The retry is a no-op dressed as a second attempt.

LLM calls break this assumption, and they break it silently. A retry does not re-fetch the same answer — it samples a new one. When a client times out at the network layer because the response was lost in transit, but the provider actually finished the generation, the retry produces a second, different answer. Now two distinct outputs exist for one logical request, and nothing in your stack knows which one is canonical.

This is not a rare edge. Practitioners running models behind timeouts report that 5–10% of requests hit the full timeout-plus-retry cycle even when the underlying call eventually succeeds. Every one of those is a coin flip your system was never designed to adjudicate.