Skip to main content

861 posts tagged with "insider"

View all tags

Warm Pools and Cold Truths: The Hidden Latency Floor of Serverless LLM Inference

· 9 min read
Tian Pan
Software Engineer

Autoscaling your GPU inference to zero looks like obvious cost discipline. The GPU is the most expensive line item on the bill, traffic is bursty, and the idle hours are pure waste. So you turn on scale-to-zero, watch the cloud invoice drop, and congratulate yourself.

Then a user shows up after a quiet stretch, and their first request takes sixty seconds to return a single token. Production deployments running serverless LLM inference routinely report cold starts exceeding 40 seconds before the first token appears — against roughly 30 milliseconds per token once the model is warm. That is a thousand-fold latency gap between the cold path and the warm path, and it is entirely a function of how idle your traffic happens to be.

This is the trade nobody puts on the slide. Scale-to-zero does not eliminate cost; it converts a steady dollar cost into a spiky latency cost, and then hides that latency cost in the p99 tail where the dashboard rarely looks.

When No One Answers the Escalation: Human-in-the-Loop Is a Staffing Problem

· 10 min read
Tian Pan
Software Engineer

Every agent architecture diagram has a box labeled "escalate to human." It is drawn with a clean arrow, it satisfies the reviewer, and it makes the system feel safe. What the diagram never shows is the person on the other end of that arrow — whether they exist, whether they are awake, and whether they will answer before the agent's patience runs out.

Human-in-the-loop is sold as a design pattern. In production it behaves like a staffing problem. The pattern assumes a human is standing by; the staffing reality is that escalations do not arrive when humans are available — they arrive on their own schedule. A burst at 2am when an overnight batch job trips a guardrail. A long tail through lunch when half the reviewers are away from their desks. A steady drip that quietly outgrows the two-person team that looked sufficient during the demo, when the agent handled ten requests a day instead of ten thousand.

The gap between "we have an escalation path" and "escalations get answered" is where agentic systems fail in ways no eval catches. The eval measures whether the agent escalates correctly. It never measures whether anyone was there.

Who Gets Paged When the Agent Is Wrong: On-Call for Non-Deterministic Systems

· 9 min read
Tian Pan
Software Engineer

The on-call rotation was built around a promise: failures reproduce. An alert fires, you re-run the request, you watch the bug happen, you find the bad commit, you roll back the deploy. Every part of that loop assumes determinism. The same input produces the same output, and the output is either right or wrong in a way you can stare at.

An agent fleet quietly breaks every link in that chain. The failure happened once, at a sampling temperature you can't replay, on a context window that has since been garbage-collected. There is no bad commit, because the code never changed — the model did, or the retrieved documents did, or the user phrased the request in a way nobody anticipated. You roll back the deploy and the deploy was never the problem.

So the page goes out, an engineer picks it up, and they discover the most uncomfortable fact about operating agents in production: they have been handed a system they cannot single-step, and the runbook in front of them was written for a different kind of machine.

The Confidence-Score Tax: Why Asking the Model How Sure It Is Costs More Than Being Wrong

· 10 min read
Tian Pan
Software Engineer

Somewhere in the evolution of every AI feature, a reviewer asks a reasonable-sounding question: "Can we have the model tell us how confident it is, so we can route the low-confidence answers to a human or a fallback?" It sounds like free insurance. You add a confidence field to the output schema, the model dutifully fills it in, and now you have a dial to turn. Ship it.

That dial is not free, and worse, it is usually not wired to anything. The confidence number is a token sequence the model is happy to produce and under no obligation to mean. Teams pay real tokens and real latency to acquire it, never check whether it correlates with correctness, and then route production traffic on it as if "0.9" were a 90% reliability estimate. It is a gauge bolted to the dashboard with nothing behind the glass.

This post is about the two costs nobody priced: the per-request tax of generating the confidence field at all, and the much larger cost of trusting an uncalibrated number to make routing decisions.

Deleting an Eval Case Is a Decision, Not Cleanup

· 10 min read
Tian Pan
Software Engineer

Every eval suite eventually gets pruned. Someone notices the suite takes nine minutes to run, costs $40 a pass, and is full of cases nobody remembers writing. They open a PR titled "clean up stale eval cases," delete forty entries that "don't seem relevant anymore," and the CI run drops to four minutes. The PR gets a thumbs-up. Nobody objects, because deleting tests looks like maintenance.

It is not maintenance. Every eval case is a guarantee the team made to itself: this failure mode will not recur silently. Deleting the case retires the guarantee. The pass rate does not change, the dashboard stays green, and the only thing that disappears is the team's memory that the guarantee ever existed. Six months later a model migration reintroduces exactly the regression a deleted case was guarding, the postmortem rediscovers a lesson the team already paid for once, and someone writes "we should add a test for this" — the test that was deleted in the cleanup PR.

The Retry That Changed the Answer: Idempotency Keys for Nondeterministic LLM Calls

· 9 min read
Tian Pan
Software Engineer

Every distributed system you have ever built leans on one quiet assumption: a retry after a timeout is safe. The operation is idempotent, so if the client gives up waiting and re-sends, the worst case is duplicate work that converges to the same state. Two PUTs land the same row. Two DELETEs leave the same absence. The retry is a no-op dressed as a second attempt.

LLM calls break this assumption, and they break it silently. A retry does not re-fetch the same answer — it samples a new one. When a client times out at the network layer because the response was lost in transit, but the provider actually finished the generation, the retry produces a second, different answer. Now two distinct outputs exist for one logical request, and nothing in your stack knows which one is canonical.

This is not a rare edge. Practitioners running models behind timeouts report that 5–10% of requests hit the full timeout-plus-retry cycle even when the underlying call eventually succeeds. Every one of those is a coin flip your system was never designed to adjudicate.

When 'Can the Agent Do X?' Becomes a Ship Commitment

· 10 min read
Tian Pan
Software Engineer

An engineer spends an afternoon poking at a question: can the agent reconcile a customer's invoice against their contract terms? They wire up a quick prompt, run it on five real invoices, and three come back correct. The other two are wrong in ways they don't fully characterize — they close the laptop and move on. In standup the next morning they say "yeah, invoice reconciliation basically works." A PM in the room writes it down. Two weeks later it's a line item on the Q3 roadmap. A month after that, a sales rep promises it to an enterprise account in a renewal call.

Nobody lied. Nobody made a bad decision in isolation. But the team is now contractually committed to a behavior whose eval set does not exist, whose failure modes were never written down, and whose reliability budget was set by a director who saw a demo and interpreted it as a contract. This is the most common way AI features acquire scope: not through a planning meeting, but through a capability probe that nobody ever explicitly promoted.

The industry has a name for the downstream symptom — "POC purgatory," the state where 70 to 80 percent of AI initiatives stall between a working sandbox and a shippable product. But purgatory is the wrong metaphor, because it implies the projects are stuck. They aren't stuck. They're moving — they were committed before anyone checked whether they were ready, and now the team is trying to retrofit reliability onto a promise.

The Agent Debugger Has No Breakpoints: Why Trace-First Workflows Replace Step-Through

· 10 min read
Tian Pan
Software Engineer

The first time you try to debug an agent the way you'd debug a service, you discover that the muscle memory has nothing to grip. You set a hypothetical breakpoint — there's no IDE pane to put it in, but you imagine one — at the step where the planner picked the wrong tool. You rerun with the same input. The planner picks the right tool this time. You rerun again. It picks a third tool you've never seen before. The bug is real, your colleague reproduced it twice this morning, and the debugger you've used for fifteen years is suddenly a museum piece.

The mental model that breaks here isn't "use a debugger." It's the much deeper assumption underneath: that a program, given the same inputs, produces the same execution. Every affordance in a modern debugger — breakpoints, step-over, watch expressions, conditional breaks, hot reload — is built on top of that determinism. You pause execution because pausing is meaningful. You step forward because the next step is knowable. You inspect a variable because its value is a fact, not a draw from a distribution.

Bring-Your-Own-Key for AI Features: The Sales-Driven Re-Architecture Nobody Costed

· 10 min read
Tian Pan
Software Engineer

The procurement team you're selling to will eventually ask the one question that resets your architecture: "Can we bring our own model API key?" Saying yes wins the deal. Saying yes also moves your trust boundary, your cost boundary, and your operational boundary at the same time — and most product teams discover this only after the contract is signed and the first month of usage produces a support ticket nobody knows how to answer.

BYOK is sold internally as a toggle. The customer pastes a key, your code reads it from the vault instead of from your own account, and inference flows the same way it always did. It is not a toggle. It is a sales-driven re-architecture that ripples through cost attribution, security incident response, observability, rate limiting, model-version pinning, and on-call accountability. The teams that ship it without acknowledging this end up rebuilding their entire platform layer a year later while a paying enterprise customer waits for fixes.

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

The Customer-Facing AI Postmortem When Nothing Crashed

· 12 min read
Tian Pan
Software Engineer

Your status page is green. Your error rate is zero. Your uptime dashboard reads 100% for the seventh consecutive month. And yet at 9:14 AM on a Tuesday, your account team is forwarding you a message from a Fortune 500 customer that says, "Our team noticed the assistant has been worse this week. Can you tell us what changed?" Twelve more like it land before lunch. None of them will be answered by the existing incident-comms playbook, because that playbook was built for outages, and nothing has crashed.

This is the customer-facing AI postmortem problem, and it is the single most consistent gap I see across teams shipping LLM features into enterprise contracts. The reliability surface has shifted from "is it up" to "is it as good as it was last week," and almost none of the comms infrastructure has caught up. Status pages don't have a tile for it. Severity rubrics don't grade it. Support macros default to "we identified an issue and resolved it," which reads as either dismissive or alarming depending on the customer's mood that day.

The First 90 Days for an AI Engineer: An Onboarding Playbook That Survives the Six-Week Doc Rot

· 12 min read
Tian Pan
Software Engineer

The new hire opens the onboarding doc. It points at a service architecture diagram from eleven months ago, a Confluence page titled "Our LLM Stack" last edited in October, and a Notion table of "model providers we use." Nothing in any of these documents tells them which prompt was tuned against which failure mode, which eval cases were added after which incident, which judge was recalibrated when the model bumped from 4.5 to 4.6, or why the system prompt for the support agent has a strange three-line preamble nobody wants to touch. Two weeks in, they ship a "small prompt cleanup" PR that removes the preamble. The eval suite passes. Production accuracy drops four points within a day.

The standard new-hire onboarding playbook — read the architecture doc, set up your laptop, do your first PR by week two — was built for engineers who join services. AI engineers join a different artifact. The thing they're going to be editing isn't a 5,000-line Go service that some staff engineer wrote; it's a 30-line prompt that survived eleven incidents and seventeen eval-driven rewrites, and the meaning of those thirty lines lives in the heads of two people on the team. Your onboarding doc cannot capture that, and trying to write a longer doc is the wrong fix.