Skip to main content

702 posts tagged with "llm"

View all tags

Cold-Start Evaluation: How to Ship an AI Feature With Zero Production Traces

· 10 min read
Tian Pan
Software Engineer

Every AI feature launch has the same quiet moment before the first user sees it: someone on the team asks "how do we know this is good?" and the honest answer is "we don't, yet." You have no traces because you have no users. You have no users because you haven't shipped. The loop is real, and the two failure modes it produces are both fatal — ship blind and let the first week of escalations be your eval dataset, or wait for "real data" and watch the roadmap slide for a quarter while a competitor publishes a demo.

The way out is not to pretend cold-start evaluation is the same problem as post-launch evaluation with a smaller sample size. It isn't. You are not sampling a distribution; you are constructing a prior. Every day-1 signal is an artifact of a choice you made about what to measure, whose behavior to simulate, and which failures to care about. Teams that ship AI features well treat the pre-launch eval stack as a first-class deliverable — not a spreadsheet hacked together the night before the gate review, but a layered system of dogfooding, simulation, expert annotation, and adversarial probes, each contributing a different kind of signal and each weighted with an explicit story about what it can and cannot tell you.

Conversation History Is a Liability Your Prompt Never Admits

· 10 min read
Tian Pan
Software Engineer

Read your product's analytics the next time a user says "the AI got dumber today." Filter to sessions over twenty turns. You will find the same U-shape every time: early turns score well, middle turns score well, late turns fall off a cliff. The prompt hasn't changed. The model hasn't changed. What changed is that every one of those late turns is carrying a payload of user typos, false starts, model hedges, corrections that were later reversed, tool outputs nobody re-read, and the fossilized remains of a goal that the user abandoned on turn four. Your prompt template treats this sediment as signal. The model does too. It shouldn't.

Chat history is not free context. It is a liability you are paying to re-send on every turn, and the dirtier it gets, the more it corrupts the answer you are billing the user for. The chat metaphor is the source of the confusion. Chat interfaces habituate users and engineers to treat the transcript as sacred — scrollable, append-only, never reset. That habit is imported wholesale into LLM applications even though it has no physical basis in how models process context. The model is stateless. The transcript is just a string you chose to grow. You can shrink it. You often should.

Cost Per Feature, Not Cost Per Token: The Allocation Gap in AI Budgets

· 10 min read
Tian Pan
Software Engineer

Your finance team can tell you, to the dollar, what you spent on Anthropic and OpenAI last month. Your product team can tell you which features users touched the most. Nobody in the building can tell you whether Draft-Email is profitable, whether Summarize-Thread should stay in the free tier, or whether the new Rewrite-Tone feature is eating Draft-Email's lunch on a per-user basis. You have two dashboards that claim to track the same dollars and neither answers the question that actually drives product decisions.

This is the allocation gap. You measure token spend per endpoint because that is what the provider API gives you. But /chat serves twelve features that happen to share a prompt template, and "per endpoint" collapses all twelve into one line item. Pricing tiers, feature gating, deprecation calls, and the "do we ship this?" conversation all float on gut feel until someone does the plumbing to route token costs back to the features that incurred them.

The plumbing is not glamorous. It is request-level tagging, trace-to-telemetry joins, and a disciplined refusal to ship an AI feature without its own cost label. Teams that treat this as infrastructure investment end up with per-feature margin reports segmented by user cohort. Teams that defer it to next quarter end up making pricing decisions from vibes for eighteen months and discovering, after the fact, that a single customer segment was responsible for half the inference bill at negative margins.

The 'We'll Add Evals Later' Trap: How Measurement Debt Compounds

· 9 min read
Tian Pan
Software Engineer

Every team that ships an AI feature without evals tells themselves the same story: we'll add measurement later, after we find product-market fit, after the prompt stabilizes, after the next release. Six months later, the prompt has been touched by four engineers and two product managers, the behavior is load-bearing for three customer integrations, and the team discovers that "adding evals later" means reconstructing intent from production logs they never structured for that purpose. The quarter that was supposed to be new features becomes a quarter of archaeology.

This isn't a planning mistake. It's a compounding one. The team that skipped evals to ship faster is the same team that will spend twelve weeks rebuilding eval infrastructure from incomplete traces, disagreeing about what "correct" meant in February, and quietly removing features nobody can prove still work. The cost of catching up exceeds the cost of building in — not by a little, but by a multiplier that grows with every prompt edit that shipped without a regression check.

The Hallucinated Success Problem: When Your Agent Says Done and Means Nothing

· 9 min read
Tian Pan
Software Engineer

The most dangerous failure in agent systems is not the loud one. It is the agent that confidently declares "Task complete" and returns a polished summary of work it never did. The file was never written. The webhook never fired. The database row is still the way it was an hour ago. But the trace is green, the completion counter ticks up, and the dashboard tells leadership the new feature is working.

This is the hallucinated success problem, and it is the single hardest bug class to catch in production because it evades every cheap signal you have. The agent did not crash. It did not time out. It did not return an error. It narrated a plausible, coherent, and completely fabricated account of a successful execution. Your observability stack was built to catch noisy failures. Silent success looks identical to real success until a user notices the output is wrong.

Your P99 Is Following a Stranger's Traffic: The Noisy-Neighbor Tax in Hosted LLM Inference

· 10 min read
Tian Pan
Software Engineer

Your dashboards are clean. The deployment from yesterday rolled back cleanly. The model version is pinned. The prompt didn't change. But your TTFT p99 just doubled, your customer success channel is on fire, and the only honest answer you can give is "it's the provider." That answer feels small — like a shrug — and it usually leads to a follow-up question that nobody on your team can answer: prove it.

This is the part of hosted LLM inference that the marketing pages do not discuss. When you call a frontier model API, you are sharing a GPU, a PCIe fabric, a continuous batch, and a KV-cache budget with workloads you cannot see. Your p99 is a function of their bursts. The economics of large-scale inference depend on multiplexing tenants tightly enough that hardware utilization stays north of 60-70%, which means your tail latency is structurally coupled to the largest, jankiest, lumpiest tenant on the same shard. You are not buying capacity; you are buying a slice of a queue that someone else is also standing in.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring

· 11 min read
Tian Pan
Software Engineer

Most teams flip JSON mode on the day they ship and never measure what it costs them. The assumption is reasonable: structured output is a correctness win, so why wouldn't you take it? The answer is that strict JSON-mode constrained decoding routinely shaves 5–15% off reasoning accuracy on math, symbolic, and multi-step analysis tasks, and nobody notices because the evals were run before the format flag was flipped — or the evals measure parseability, not quality.

The output format is a decoding-time constraint, and like every constraint it warps the model's probability distribution. The warp is invisible when you look at logs: the JSON is valid, the schema matches, the field types line up. What you cannot see in the logs is the reasoning that the model would have produced in prose but could not fit inside the grammar you gave it. The format tax is real, well-documented in the literature, and almost universally unmeasured in production.

This post is about when to pay it, how to stop paying it when you don't have to, and what a format-choice decision tree actually looks like for engineers who want structured output and accuracy at the same time.

Multi-Model Reliability Is Not 2x: The Non-Linear Cost of a Second LLM Provider

· 13 min read
Tian Pan
Software Engineer

The naive calculation goes like this. Our primary provider has 99.3% uptime. Add a second provider with similar independence, and simultaneous failure drops to roughly 0.005%. Multiply cost by two, divide risk by two hundred. Engineering leadership signs off on the 2x budget and the oncall rotation stops paging on provider outages. The spreadsheet says this is the best reliability investment on the roadmap.

Six months later the spreadsheet is wrong. The eval suite takes 3x as long to run, prompt changes need two PRs, the weekly regression report has two columns that disagree with each other, and nobody can remember which provider the staging fallback is currently routing to. The 2x budget is closer to 4–5x once the team tallies the human hours spent keeping both paths calibrated. The second provider is still technically serving traffic, but half the features have been quietly pinned to one side because keeping both in sync stopped being worth it.

This is the multi-model cost trap. The reliability math is correct; the operational math is the part teams get wrong. What follows is the cost decomposition of going multi-provider, the single-provider-with-degraded-mode option most teams should try first, and the narrow set of criteria that actually justify the nonlinear complexity.

The Output Commitment Problem: Why Streaming Self-Correction Destroys User Trust More Than the Original Error

· 10 min read
Tian Pan
Software Engineer

A user asks your agent a question. Tokens start flowing. Three sentences in, the model writes "Actually, let me reconsider — " and pivots to a different answer. The revised answer is better. The user closes the tab.

This is the output commitment problem, and it is one of the most consistently underestimated UX failures in shipped AI products. The engineering mindset treats self-correction as a feature — the model noticed its own error, that is the system working as intended. The user-perception mindset treats it as a disaster — the product demonstrated, live, that its first confident claim was wrong. Those two readings are both correct, and they do not reconcile on their own.

The core asymmetry is that streaming makes thinking legible, and legible thinking is auditable thinking. A model that hallucinated silently and then produced a clean final answer would look competent. The same model, streaming every half-thought, looks like it is flailing. The answer quality is identical. The perception is not.

Pattern-Matching Failures: When Your LLM Solves the Wrong Problem Fluently

· 11 min read
Tian Pan
Software Engineer

A user pastes a long, complicated bug report into your AI assistant. It looks like a classic null-pointer question, with the same phrasing and code layout as thousands of Stack Overflow posts. The model responds confidently, cites the usual fix, and sounds authoritative. The user thanks it. The bug is still there. The report was actually about a race condition; the null-pointer framing was incidental to how the user described the symptom.

This is the single hardest bug class to catch in a production LLM system. The model did not refuse. It did not hedge. It did not hallucinate a fake API. It solved the wrong problem, fluently, and everyone downstream — the user, your eval pipeline, your guardrails — saw a plausible on-topic answer and moved on. I call these pattern-matching failures: the model latched onto surface features of the query and produced a confident answer to something adjacent to what was actually asked.

Your Planner Knows About Tools Your User Can't Call

· 9 min read
Tian Pan
Software Engineer

A free-tier user opens your support chat and asks, "Can you issue a refund for order #4821?" Your agent replies, "I'm not able to issue refunds — that's a manager-only action. You could escalate via the dashboard, or I can transfer you." The refusal is correct. The ACL at the refund tool is correct. And you have just told an anonymous user that a tool named issue_refund exists, that it is gated by a role called manager, and that your platform accepts order IDs of the shape #NNNN.

Your planner knows about tools your user can't call. That asymmetry — full catalog visible to the reasoning layer, partial catalog executable at the action layer — is where most agent authorization gets quietly wrong. ABAC at the tool boundary catches the unauthorized invocation. It doesn't catch the capability disclosure that already happened one token earlier, in the plan, the refusal, or the "helpful" suggestion of a workaround.