780 posts tagged with "ai-engineering"

When the Cheap Model Is the Expensive One

May 17, 2026 · 9 min read

Software Engineer

A finance team flags that the LLM bill is up 18% this quarter. An engineer pulls the usage dashboard, sees that 70% of traffic now hits the budget model instead of the frontier one, and is briefly confused: the routing change was supposed to cut spend. The per-token price went down exactly as the spreadsheet promised. The bill went up anyway.

This is not a billing error. It is the most common way a cost optimization quietly inverts itself. The spreadsheet that justified the downgrade priced one thing — tokens — and the production system pays for something else entirely: finished tasks. A weaker model does not just produce cheaper tokens. It changes the behavior of every component around it, and those second-order effects land on the same invoice.

The trap is seductive because the first-order math is genuinely correct. A budget model can be 10x to 30x cheaper per token than a frontier model, and for a large fraction of traffic it returns an answer that is indistinguishable in quality. The mistake is not the routing decision. The mistake is measuring the routing decision at the wrong boundary.

The 14-Month Half-Life of Your Prompt Expert

May 16, 2026 · 9 min read

Tian Pan

Software Engineer

Every company shipping AI features in production has one or two engineers it cannot afford to lose, and most of them do not know who those engineers are until the resignation email arrives.

The person in question is rarely the loudest in the room. They are the one who remembers that the customer-support summarizer's tone got fixed by a three-line system-prompt edit after the Q2 escalation, that the eval suite added six cases the week the model provider quietly changed its default sampling, and that the judge calibration drifted the last time someone "cleaned up" the rubric. None of this is written down in a place a successor would find. It lives in one head, and that head is being messaged by a recruiter with a 25% raise attached roughly every two weeks.

The Confidence-Score Tax: Why Asking the Model How Sure It Is Costs More Than Being Wrong

May 16, 2026 · 10 min read

Tian Pan

Software Engineer

Somewhere in the evolution of every AI feature, a reviewer asks a reasonable-sounding question: "Can we have the model tell us how confident it is, so we can route the low-confidence answers to a human or a fallback?" It sounds like free insurance. You add a confidence field to the output schema, the model dutifully fills it in, and now you have a dial to turn. Ship it.

That dial is not free, and worse, it is usually not wired to anything. The confidence number is a token sequence the model is happy to produce and under no obligation to mean. Teams pay real tokens and real latency to acquire it, never check whether it correlates with correctness, and then route production traffic on it as if "0.9" were a 90% reliability estimate. It is a gauge bolted to the dashboard with nothing behind the glass.

This post is about the two costs nobody priced: the per-request tax of generating the confidence field at all, and the much larger cost of trusting an uncalibrated number to make routing decisions.

The PM-Eval Translation Gap: When Ship Decisions Outrun the Vocabulary

May 16, 2026 · 8 min read

Tian Pan

Software Engineer

The go/no-go meeting for an AI feature is, on the surface, a data-driven ritual. Engineering brings a slate of eval numbers — judge score deltas, slice accuracies, regression-against-baseline percentages — and the room decides. It looks rigorous. It usually isn't.

Here is the failure mode in one sentence: the person with the literacy to weight the eval slices does not have the authority to make the call, and the person with the authority cannot read the slices. The product manager owns the launch. The engineer owns the meaning of the numbers. Between them sits a translation gap, and into that gap rushes whoever speaks most confidently in the meeting.

The tell is that "ship at 87%" and "hold at 87%" are both defensible from the same scorecard, depending on which slice you weight. When a single dataset supports opposite conclusions and the deciding factor is rhetorical confidence rather than evidence, you do not have a data-driven process. You have a debate with a spreadsheet in the background.

The Retry That Changed the Answer: Idempotency Keys for Nondeterministic LLM Calls

May 16, 2026 · 9 min read

Tian Pan

Software Engineer

Every distributed system you have ever built leans on one quiet assumption: a retry after a timeout is safe. The operation is idempotent, so if the client gives up waiting and re-sends, the worst case is duplicate work that converges to the same state. Two PUTs land the same row. Two DELETEs leave the same absence. The retry is a no-op dressed as a second attempt.

LLM calls break this assumption, and they break it silently. A retry does not re-fetch the same answer — it samples a new one. When a client times out at the network layer because the response was lost in transit, but the provider actually finished the generation, the retry produces a second, different answer. Now two distinct outputs exist for one logical request, and nothing in your stack knows which one is canonical.

This is not a rare edge. Practitioners running models behind timeouts report that 5–10% of requests hit the full timeout-plus-retry cycle even when the underlying call eventually succeeds. Every one of those is a coin flip your system was never designed to adjudicate.

When 'Can the Agent Do X?' Becomes a Ship Commitment

May 15, 2026 · 10 min read

Tian Pan

Software Engineer

An engineer spends an afternoon poking at a question: can the agent reconcile a customer's invoice against their contract terms? They wire up a quick prompt, run it on five real invoices, and three come back correct. The other two are wrong in ways they don't fully characterize — they close the laptop and move on. In standup the next morning they say "yeah, invoice reconciliation basically works." A PM in the room writes it down. Two weeks later it's a line item on the Q3 roadmap. A month after that, a sales rep promises it to an enterprise account in a renewal call.

Nobody lied. Nobody made a bad decision in isolation. But the team is now contractually committed to a behavior whose eval set does not exist, whose failure modes were never written down, and whose reliability budget was set by a director who saw a demo and interpreted it as a contract. This is the most common way AI features acquire scope: not through a planning meeting, but through a capability probe that nobody ever explicitly promoted.

The industry has a name for the downstream symptom — "POC purgatory," the state where 70 to 80 percent of AI initiatives stall between a working sandbox and a shippable product. But purgatory is the wrong metaphor, because it implies the projects are stuck. They aren't stuck. They're moving — they were committed before anyone checked whether they were ready, and now the team is trying to retrofit reliability onto a promise.

The Agent Debugger Has No Breakpoints: Why Trace-First Workflows Replace Step-Through

May 14, 2026 · 10 min read

Tian Pan

Software Engineer

The first time you try to debug an agent the way you'd debug a service, you discover that the muscle memory has nothing to grip. You set a hypothetical breakpoint — there's no IDE pane to put it in, but you imagine one — at the step where the planner picked the wrong tool. You rerun with the same input. The planner picks the right tool this time. You rerun again. It picks a third tool you've never seen before. The bug is real, your colleague reproduced it twice this morning, and the debugger you've used for fifteen years is suddenly a museum piece.

The mental model that breaks here isn't "use a debugger." It's the much deeper assumption underneath: that a program, given the same inputs, produces the same execution. Every affordance in a modern debugger — breakpoints, step-over, watch expressions, conditional breaks, hot reload — is built on top of that determinism. You pause execution because pausing is meaningful. You step forward because the next step is knowable. You inspect a variable because its value is a fact, not a draw from a distribution.

The AI Accessibility Audit Nobody Runs

May 14, 2026 · 11 min read

Tian Pan

Software Engineer

Open your agent product, turn on VoiceOver, and hit send on any prompt. If you have a typical streaming UI with an inline reasoning trace, what you will hear in the next thirty seconds is not your product. It is a torrent of partial tokens, mid-word reflows, status changes nobody announced, and a reasoning monologue your sighted users opted into but your blind users cannot escape. The interface that demoed beautifully on stage is, to a screen reader, a denial-of-service attack delivered as speech.

This is the audit nobody on the AI team runs. The design review approved the streaming animation. The eval suite measured answer quality. The latency dashboard tracked time-to-first-token. None of those instruments noticed that the affordance making the product feel fast and thoughtful for one cohort makes it unusable for another. And that omission is starting to show up in pro-se lawsuit filings — the same federal courts that have been processing accessibility complaints against e-commerce sites for a decade are now seeing AI-interface complaints rise sharply, with one tracker reporting a 40% year-over-year increase in 2025 alone.

The AI Feature Sunset Playbook Nobody Writes

May 14, 2026 · 13 min read

Tian Pan

Software Engineer

Every AI org has a graveyard. Not of services — those get a runbook, a deprecation banner, a 30-day migration window, and a slot on the platform team's quarterly roadmap. The graveyard is of features: the smart-summary beta that never graduated, the auto-categorizer that two enterprise customers actually built workflows around, the agentic flow that demoed beautifully and shipped behind a flag that nobody flipped off. The endpoint is easy to deprecate. The four other things attached to it — the prompt, the judge, the regression set, and the incident memory — are what actually take a quarter, and nobody on the team has written the playbook because nobody has been promoted for retiring something.

This is the gap. Most of the public discourse on "model deprecation" is about vendor-side retirements: GPT-4o leaves on a date, Assistants API beta sunsets on August 26, DALL-E 3 retires on May 12, and your platform team has a notification period to migrate. That problem has playbooks because vendors publish dates, because the migration is forced, and because the work fits in a sprint. The internal version — when you decide a feature you built didn't graduate, and you have to actually take it out — has none of those forcing functions. The deprecation date is whatever you say it is. The migration path is whatever you build. And the artifacts you have to retire are not a single endpoint but a tangled stack of model-adjacent assets that your monitoring barely knows exist.

The AI Told Me So Defense: When Code Review Quietly Stops Pushing Back

May 14, 2026 · 11 min read

Tian Pan

Software Engineer

The single most expensive sentence in a 2026 code review thread is "the agent wrote it this way." Not because it's wrong — sometimes it isn't — but because it ends a conversation that used to start one. The reviewer types a question, the author quotes the model's reasoning back at them, and the thread resolves before anyone has actually argued about the change. The social cost of disagreeing with a confident, well-spoken model has quietly become higher than the cost of merging a subtle bug, and most teams won't see the trade in their metrics for another two quarters.

This is not a story about whether AI writes good code. It writes code, some of it good. This is a story about what happens to a quality gate when the friction at composition time collapses. Review velocity rises, defect rate rises in lockstep, and the correlation isn't obvious because nobody is tracking review-time-to-defect with the author class attached. The senior engineer who used to be the gravity well of taste in the codebase becomes the lone holdout in a culture quietly recalibrating around model deference.

The Composability Tax: Why Adding Tools Makes Your Planner Worse

May 14, 2026 · 9 min read

Tian Pan

Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

The Customer-Facing AI Postmortem When Nothing Crashed

May 14, 2026 · 12 min read

Tian Pan

Software Engineer

Your status page is green. Your error rate is zero. Your uptime dashboard reads 100% for the seventh consecutive month. And yet at 9:14 AM on a Tuesday, your account team is forwarding you a message from a Fortune 500 customer that says, "Our team noticed the assistant has been worse this week. Can you tell us what changed?" Twelve more like it land before lunch. None of them will be answered by the existing incident-comms playbook, because that playbook was built for outages, and nothing has crashed.

This is the customer-facing AI postmortem problem, and it is the single most consistent gap I see across teams shipping LLM features into enterprise contracts. The reliability surface has shifted from "is it up" to "is it as good as it was last week," and almost none of the comms infrastructure has caught up. Status pages don't have a tile for it. Severity rubrics don't grade it. Support macros default to "we identified an issue and resolved it," which reads as either dismissive or alarming depending on the customer's mood that day.

About Tian Pan