Skip to main content

639 posts tagged with "llm"

View all tags

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

The Vendor-Portability Tax: Why 'We Can Swap Models' Is a Quarterly Cost Line, Not a Checkbox

· 11 min read
Tian Pan
Software Engineer

Every team I have audited in the last six months claims to be vendor-agnostic. None of them are. The system prompt that scored highest on the eval suite did so because it leaned into a single vendor's tokenizer behavior, JSON-mode contract, refusal cadence, and stop-sequence handling — and the team that wrote it could not name which of those biases were doing the work. When the CFO asks why the cheaper model on the procurement deck cannot just be dropped in, the honest answer is two engineer-quarters of prompt re-tuning and a complete re-baseline of every eval. That is not a checkbox. It is a quarterly cost line.

The mental model that keeps biting teams is treating vendor portability as a one-time architecture decision. You add an abstraction layer, you write a model: field in your config, you congratulate yourself, and you move on. Then a year later the vendor raises prices, ships a deprecation notice, or has a bad week of refusals on a category you care about, and you discover that the abstraction was a thin wrapper around a prompt that only works on one model. The portability you bought was syntactic. The portability you needed was behavioral, and behavioral portability decays the moment you stop paying for it.

Your Model Update Is a Breaking Change: The Behavioral Changelog You Owe Your Integrators

· 12 min read
Tian Pan
Software Engineer

A vendor pushes a "minor refresh" to a model alias on a Tuesday afternoon. By Thursday, four customer companies are running incident response. None of them deployed code that week. None of their dashboards show a regression in latency, error rate, or any other infra-shaped metric. What changed is that the model behind their pinned alias started returning slightly different sentences, slightly different JSON, and slightly different refusals — and every prompt their team wrote against the old behavior is now a contract that nobody honored.

The asymmetry is the entire story. The provider treated the rollout as a deploy: tested internally, gated on a few aggregate evals, ramped to 100% within a maintenance window. The consumer surface received it as a semver violation: a dependency upgraded itself in production without changing its version string, and the bug reports started rolling in from end users with the cheerful subject line "nothing changed on our side."

The Inference Budget Committee: Governance When Token Spend Crosses Seven Figures

· 12 min read
Tian Pan
Software Engineer

At $50,000 a month, the "compute + tokens" line on your infra bill is rounding error. At $5,000,000 a month, it is a CFO question. The transition between those two states is not gradual — it is a phase change in how an organization talks about model spend, and most engineering orgs are unprepared for the social and political work that follows. The bill stays a single line; the conversation around it does not.

What changes is who has standing to ask "why." When three product teams share one API key and one capacity reservation, every quota argument has the same structure: someone is currently winning at the expense of someone else, and there is no neutral party to call it. The first time a team's launch is throttled because another team shipped a chatty agent, the absence of a governance body is felt by the entire engineering org at once. Calling a meeting and inventing a process under pressure is the worst time to design one.

The Local-Maximum Trap in Prompt Iteration: How to Tell You're Tweaking the Wrong Thing

· 10 min read
Tian Pan
Software Engineer

There is a moment, six weeks into a serious LLM project, where the prompt iteration log starts to look like a therapy journal. Each tweak swaps one failure mode for another. Add a stricter "do not" clause and the model becomes evasive on cases it used to handle. Soften the tone and a different category of hallucination returns. The eval scoreboard hovers in a band three or four points wide, refusing to break out. Someone says, "let me try one more reordering," and another half day evaporates.

This is the local-maximum trap. The team is climbing a hill, but the hill does not go higher. The cruel part is that the hill is real — every prompt change does produce a measurable delta on some subset of cases, which is exactly the signal that keeps everyone tweaking. What's missing is the recognition that the ceiling above is not a prompt ceiling at all.

Sovereignty Collapse: Logging Where Your Prompt Actually Went

· 9 min read
Tian Pan
Software Engineer

A regulator asks a simple question. "For this specific user prompt, submitted at 14:32 UTC last Tuesday, prove which jurisdictions the request and its derived state passed through."

Your application logs say model=claude-sonnet-4-5, region=eu-west-1, latency=2.1s. Your gateway logs say the same. Your provider's invoice confirms the request happened. None of these answer the question. The request entered an EU-hosted gateway, was forwarded to a US-region primary endpoint that failed over to Singapore during a regional incident, and warmed a KV cache on a third-party GPU pool whose residency claims live in a vendor footnote. The audit trail you needed lives at a layer your team does not own.

This is sovereignty collapse: the gap between what your contracts promise about data location and what your runtime can actually prove after the fact. The compliance claim is only as strong as the weakest log line in the chain.

The Query Rewriting Layer Your RAG Pipeline Skipped

· 10 min read
Tian Pan
Software Engineer

When a RAG system answers wrong, the first instinct on most teams is to blame the encoder. Swap to a bigger embedding model. Try a domain-tuned one. Bump the dimension count. Three sprints later the recall curve has nudged a few points and the user complaints look the same.

The diagnosis was wrong. Most retrieval failures aren't embedding failures. They're query-shape failures — and no amount of vector tuning fixes a vocabulary mismatch that exists before the encoder ever runs.

A user types "how do I cancel." The relevant document is titled "Subscription Lifecycle Management" and uses words like "termination," "billing cycle close," and "service deactivation." There is no encoder in the world that pulls those two strings into the same neighborhood by lexical luck. The cosine similarity gap is real, and it lives in the input, not the model. The query rewriting layer that goes ahead of retrieval is the thing most pipelines skip and then spend a quarter trying to compensate for downstream.

Trace Sampling for Agents: Which of 10 Million Daily Spans Are Worth Keeping

· 11 min read
Tian Pan
Software Engineer

A web service request produces five spans on a busy day. A modern agent session produces fifty, sometimes a thousand if the planner decides to recurse. The uniform 1% sampler your platform team copy-pasted from the microservices era will, by definition, drop the rare failure you actually care about — because the failure is rare, and uniform sampling has no opinion about rarity.

The honest version of "we have full observability on our agents" sounds different than the marketing version. It sounds like: we keep the traces that matter, drop the ones that don't, and we know in advance which is which. Every word in that sentence is load-bearing, and the platform teams that ignored sampling design until the bill arrived are now learning the discipline backwards — under cost pressure, after a quarter of incidents that were "in the data" but evicted before anyone looked.

The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans

· 11 min read
Tian Pan
Software Engineer

A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.

This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.

Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces

· 10 min read
Tian Pan
Software Engineer

Every AI feature launch has the same quiet moment before the first user sees it: someone on the team asks "how do we know this is good?" and the honest answer is "we don't, yet." You have no traces because you have no users. You have no users because you haven't shipped. The loop is real, and the two failure modes it produces are both fatal — ship blind and let the first week of escalations be your eval dataset, or wait for "real data" and watch the roadmap slide for a quarter while a competitor publishes a demo.

The way out is not to pretend cold-start evaluation is the same problem as post-launch evaluation with a smaller sample size. It isn't. You are not sampling a distribution; you are constructing a prior. Every day-1 signal is an artifact of a choice you made about what to measure, whose behavior to simulate, and which failures to care about. Teams that ship AI features well treat the pre-launch eval stack as a first-class deliverable — not a spreadsheet hacked together the night before the gate review, but a layered system of dogfooding, simulation, expert annotation, and adversarial probes, each contributing a different kind of signal and each weighted with an explicit story about what it can and cannot tell you.

Conversation History Is a Liability Your Prompt Never Admits

· 10 min read
Tian Pan
Software Engineer

Read your product's analytics the next time a user says "the AI got dumber today." Filter to sessions over twenty turns. You will find the same U-shape every time: early turns score well, middle turns score well, late turns fall off a cliff. The prompt hasn't changed. The model hasn't changed. What changed is that every one of those late turns is carrying a payload of user typos, false starts, model hedges, corrections that were later reversed, tool outputs nobody re-read, and the fossilized remains of a goal that the user abandoned on turn four. Your prompt template treats this sediment as signal. The model does too. It shouldn't.

Chat history is not free context. It is a liability you are paying to re-send on every turn, and the dirtier it gets, the more it corrupts the answer you are billing the user for. The chat metaphor is the source of the confusion. Chat interfaces habituate users and engineers to treat the transcript as sacred — scrollable, append-only, never reset. That habit is imported wholesale into LLM applications even though it has no physical basis in how models process context. The model is stateless. The transcript is just a string you chose to grow. You can shrink it. You often should.

Cost Per Feature, Not Cost Per Token: The Allocation Gap in AI Budgets

· 10 min read
Tian Pan
Software Engineer

Your finance team can tell you, to the dollar, what you spent on Anthropic and OpenAI last month. Your product team can tell you which features users touched the most. Nobody in the building can tell you whether Draft-Email is profitable, whether Summarize-Thread should stay in the free tier, or whether the new Rewrite-Tone feature is eating Draft-Email's lunch on a per-user basis. You have two dashboards that claim to track the same dollars and neither answers the question that actually drives product decisions.

This is the allocation gap. You measure token spend per endpoint because that is what the provider API gives you. But /chat serves twelve features that happen to share a prompt template, and "per endpoint" collapses all twelve into one line item. Pricing tiers, feature gating, deprecation calls, and the "do we ship this?" conversation all float on gut feel until someone does the plumbing to route token costs back to the features that incurred them.

The plumbing is not glamorous. It is request-level tagging, trace-to-telemetry joins, and a disciplined refusal to ship an AI feature without its own cost label. Teams that treat this as infrastructure investment end up with per-feature margin reports segmented by user cohort. Teams that defer it to next quarter end up making pricing decisions from vibes for eighteen months and discovering, after the fact, that a single customer segment was responsible for half the inference bill at negative margins.