Skip to main content

30 posts tagged with "inference"

View all tags

The Deterministic Seed Your Provider Treated as a Hint, Not a Contract

· 10 min read
Tian Pan
Software Engineer

The CI test was a single assertion: same model, same temperature, same prompt, same seed, same output string. It passed on every developer's laptop, passed on the first hundred CI runs, and then flaked once every fifty runs for three weeks before anyone admitted the pattern was real. The first hypothesis was the obvious one — a non-deterministic dependency somewhere in the test harness — and three days of investigation found nothing. The actual cause was sitting in a footnote on the provider's API reference: "seed provides best-effort determinism." The team had read the parameter name and assumed a contract. The provider had documented a hint.

This is a specific failure mode of hosted inference that catches teams who design test infrastructure around a single mental model: the model is a pure function of its inputs, and the seed is what makes the function reproducible. Both halves of that model are wrong in production, and the gap between the API surface and the underlying physics is wide enough that teams build entire eval and regression-test stacks on top of an assumption their provider explicitly disclaimed.

The GPU Reservation Your Batch Workload Starved Your Real-Time Path On

· 9 min read
Tian Pan
Software Engineer

The nightly fine-tune job starts at 02:00 UTC. It walks into the shared GPU pool, takes every slot it can find, and holds them. By 09:30, when the first inference traffic of the business day arrives, the autoscaler tries to claim capacity that has been continuously occupied for seven and a half hours. The first ninety minutes of the morning run at roughly four times the baseline p99 latency. The dashboard reports a "noisy morning tail" that the inference team attributes to user behavior, because the actual contention lives in a job queue nobody on the inference team owns.

This is the GPU-sharing failure mode that the cost-attribution slide in your capacity review does not capture. The sharing was sold as a utilization win — train at night, serve in the day, fill the trough. What actually shipped was a latency tail you cannot escape until the pool is partitioned by latency class, not by team or by clock.

The Slow Turn That Wasn't Yours: KV Cache Eviction Mid-Conversation

· 10 min read
Tian Pan
Software Engineer

A conversation has been moving along on a single Claude session for forty minutes. Eleven turns, each averaging 800ms time-to-first-token, each cheap because the 28,000-token prefix is hitting the prompt cache. Turn twelve arrives and TTFT is 3.4 seconds. The transcript hasn't changed shape. The model didn't switch. The network is fine. Cached input tokens drop from 27,800 to 0. The next turn's prefill bill is paid in full, from the first token.

You go looking for the cause in your traces and find nothing that names it. There is no event in your logs labeled "another tenant's burst evicted you." The only honest reading of the spike is that some other customer's prompt, somewhere on the same GPU pool, made the scheduler decide your warm prefix was the cheapest thing to drop. You cannot replay the turn. You cannot prove the eviction. The cache state at that moment was a function of strangers' traffic, and that traffic is not in your trace because it was never yours to see.

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.

Warm Pools and Cold Truths: The Hidden Latency Floor of Serverless LLM Inference

· 9 min read
Tian Pan
Software Engineer

Autoscaling your GPU inference to zero looks like obvious cost discipline. The GPU is the most expensive line item on the bill, traffic is bursty, and the idle hours are pure waste. So you turn on scale-to-zero, watch the cloud invoice drop, and congratulate yourself.

Then a user shows up after a quiet stretch, and their first request takes sixty seconds to return a single token. Production deployments running serverless LLM inference routinely report cold starts exceeding 40 seconds before the first token appears — against roughly 30 milliseconds per token once the model is warm. That is a thousand-fold latency gap between the cold path and the warm path, and it is entirely a function of how idle your traffic happens to be.

This is the trade nobody puts on the slide. Scale-to-zero does not eliminate cost; it converts a steady dollar cost into a spiky latency cost, and then hides that latency cost in the p99 tail where the dashboard rarely looks.

LLM Tail Latency: Why Your P99 Is a Disaster When P50 Looks Fine

· 10 min read
Tian Pan
Software Engineer

Your LLM API returns a median (P50) latency of 800 milliseconds. Your dashboard is green. Your SLAs say "under two seconds." Then a user files a support ticket: "it just spins for thirty seconds and then gives up." You check the logs and see a P99 of 28 seconds.

That gap — a 35x ratio between median and tail latency — is not a fluke. It is a structural property of how LLMs work, and it will not go away by tuning your timeouts.

Diffusion Models in Production: The Engineering Stack Nobody Discusses After the Demo

· 10 min read
Tian Pan
Software Engineer

Your image generation feature just went viral. 100,000 requests are coming in daily. The API provider's rate limit technically accommodates it. Latency crawls to 12 seconds at p95. Your NSFW classifier is flagging legitimate medical illustrations. A compliance audit surfaces that California's AI Transparency Act required watermarking since September 2024. Support has 50 open tickets from users whose content was silently blocked. By the time you realize you need a real production stack, you've already burned two weeks in crisis mode.

This is the moment "just call the API" fails—not because the API is bad, but because the demo's success exposes every assumption you made about inference latency, content policy, moderation fairness, and regulatory compliance. The engineering work nobody shows you in tutorials lives here.

Thinking Budgets: When Extended Reasoning Models Actually Make Economic Sense

· 10 min read
Tian Pan
Software Engineer

A surprising number of AI teams default to extended thinking on every query once they gain access to an o3-class or Claude extended thinking model. The logic seems obvious: smarter reasoning equals better outputs, so why not always enable it? The problem is that this reasoning fails to account for a basic fact of how test-time compute scaling works in practice. Extended thinking dramatically improves performance on a specific class of tasks, degrades quality on others, and can inflate your inference costs by 5–30x across the board. The teams getting the most value from these models treat the reasoning budget as an explicit decision — one with the same weight as model selection or prompt engineering.

This post lays out the task taxonomy, the cost structure, and the routing decision framework that distinguishes teams who use thinking budgets strategically from teams who are just paying a premium for an illusion of quality.

The Budget Inversion Trap: Why Your Most Valuable AI Features Get the Cheapest Inference

· 8 min read
Tian Pan
Software Engineer

Most teams optimize AI inference costs by routing cheaper queries to cheaper models. That sounds reasonable — and it's backwards. The queries that go to cheap models first aren't the simple ones. They're the complex ones, because those are the expensive ones your FinOps dashboard flagged.

The result: your contract renewal workflow, the one that closes six-figure deals, runs on a model that hallucinates clause references. Your customer support triage — entry-level stuff, genuinely low-stakes — gets frontier model treatment because nobody complained about it yet.

This is the budget inversion trap. It's not caused by negligence. It's the predictable output of applying cost pressure without value context.

The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

· 9 min read
Tian Pan
Software Engineer

You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.

This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.

Silent Quantization: Why the Model You Pay For Today Isn't the Model You Paid For Last Quarter

· 11 min read
Tian Pan
Software Engineer

The model name on your invoice is the same as it was last quarter. The version string in the API response hasn't changed. The model card and pricing page read identically. And yet your eval scores have drifted half a point downward, your refusal patterns shifted in ways your prompts didn't ask for, and a handful of customer complaints came in last Tuesday about output that "feels different." You debug your code. You don't find anything. The code didn't change. The weights did.

Silent quantization is the gap between the model you contracted for and the model the provider is actually serving. It happens because inference economics keep tightening — every dollar of GPU capacity has to feed more requests this quarter than last — and the cheapest way to absorb that pressure is to re-host the same model name on cheaper precision tiers. FP16 becomes FP8. FP8 becomes FP4 in some routes. Mixed-precision shards get swapped in. The version string doesn't move because the version string was never a precision contract; it was a marketing contract.

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

· 11 min read
Tian Pan
Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.